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Abstract 

This paper presents Haskell^, a coordination language targeted 
at the efficient implementation of parallel scientific applications on 
loosely coupled parallel architectures, using the functional language 
Haskell. Its programming environment encompasses an editor, a com- 
piler into Petri nets, a Petri net animator and proof tool, and a skele- 
ton library. Examples of applications, their implementation details 
and performance figures are presented. 



1 Introduction 

The peak performance of parallel architectures is growing at a faster pace 
than predicted by Moore's law, that states that at each 18 months com- 
puter hardware becomes twice as fast and halves its sale price. However, 
parallel programming tools have not being able to reconcile portability, scal- 
ability and a higher level of abstraction without imposing severe performance 
penalties to applications [28] . 
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The emerging technologies in the 1990s gave birth to new challenges in 
high-performance computing. The advent of clusters |H], low cost super- 
computers built on top of networks of workstations and personal computers, 
disseminated super computing among academic institutions, industries and 
companies [HI HI [191 120]. More recently, advances in wide area network in- 
terconnection technologies have made possible to use their infra-structure to 
build distributed supercomputers of virtually infinite scale, the grids, which 
are particularly suitable for addressing very coarse grained scientific com- 
puting applications. Great efforts to make these technologies viable are been 
promoted, with promising results [M] . 

Clusters and grids sparkled a myriad of new applications in supercomput- 
ing for scientific computation. Most of them are not addressed adequately by 
contemporary tools, yielding inefficient distribution of parallel programs [83] . 
In ^lOj, some parallel programming approaches used in scientific computing 
are compared in relation to scalability (efficiency), generality and abstraction 
dimensions. MPI (Message Passing Interface) |63J, the most widespread mes- 
sage passing library, provides scalabihty, generality, but is less abstract than 
TCE (Tensor Contraction Engine) [7], PETSc [5J, GA (Global Array) [SB], 
openMP [67j, auto-parallelized C/Fortran90 and HPF (High Performance 
Fortran) [32]. PETSc and TCE are specific purpose libraries for scientific 
computing, providing a high level of abstraction and scalability. Implicit 
approaches, such as C/Fortran90, present low scalability, high level of ab- 
straction and high generality. These observations illustrate that, despite the 
efforts conducted on the last decade, the need for new parallel programming 
environments that reconcile a high-level of abstraction, modularity, and gen- 
erality with scalability and peak performance is still a challenge [281 EH ES] . 

This paper presents Haskell^, a process-oriented coordination language 
[53] where Haskell [75J, a language considered de facto a standard in lazy 
functional programming, is used for programming at computation level. Haskell# 
aims to provide high-level programming mechanisms without sacrificing per- 
formance significantly, by minimizing the overheads of the management of 
parallelism. One of the most important concerns in Haskell^ is to make 
easier to prove correctness of programs. For that, a divide-and-conquer ap- 
proach was adopted to increase the chances of formally analyzing programs: 
the process network is completely orthogonal to the sequential blocks of code 
(process functionality). Haskell allows sequential programs to be proved cor- 
rect in a simpler fashion than their equivalent written in languages which 
belong to other programming paradigms. The communication primitives 
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were designed in such a way as to allow their translation into Petri nets [72], 
a well reputed formalism for the specification of concurrent systems, with 
several analysis and verification tools |80l [12] available. 

Haskell^ emphasizes compositional programming and provides support 
for skeletons [25]. Skeletons are used to expose topological information that 
can guide the Haskell^ compiler in the generation of more efficient code. 
MPI (Message Passing Interface) [21] is used to manage parallelism without 
claiming for any run-time support. Due to the recent development of inter- 
operable [84j and grid enabled [49j versions of MPI Haskell# programs may 
be executable on grids without any extra burden. 

Examples of benchmark programs and their performance figures are pro- 
vided, elucidating the most important aspects of programming in Haskell^. 

This paper comprises five other sections. Section [2] gives background for 
programming in Haskell#, focusing on programming abstractions. Section [3] 
presents motivating application examples of Haskell# programming. Section 
IDpresents details about current implementation of Haskell^ for clusters. Sec- 
tion O presents performance figures about applications presented in Section [3] 
running on implementation described in Section HJ Section El concludes this 
paper outlining the work in progress with Haskell^. 

2 Programming in Haskell^ 

Haskell^ programs are composed from a set of components, each one describ- 
ing an application concern. Concerns may be functional or non-functional. 
Examples of functional concerns are the calculation of an exact solution for 
a system of linear equations and the calculation of a finite-difference approx- 
imation for a system of partial differential equations. An example of non- 
functional concern is the allocation of processes to processors. Components 
may be reused among Haskell# programs. 

In Haskell^ programming, the process of composing components is in- 
ductive. Simple components, functional modules implemented in Haskell, 
are basic building blocks. Given a collection of components, simple or com- 
posed ones, it is possible to define a new composed component by specifying 
their composition through Haskell# Configuration Language (HCL). The re- 
sult of this process is a hierarchy of components, where the main component, 
describing the application functionality, is at the root. Components at the 
leaves are simple components (always addressing functional concerns) and 
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intermediate nodes are composed components. 

Under perspective of process-oriented coordination models [33], the col- 
lection of functional modules of a Haskell^ program forms a computation 
medium, while the collection of composed components forms a coordination 
medium. The concerns on the parallel composition of Haskell functional com- 
putations are sufficiently and necessarily resolved at the coordination level. 
The use of Haskell for programming the computation medium allows that 
coordination and computation languages be really orthogonal. Lazy lists 
allow the overlapping of communication and computation in process execu- 
tion, without to need to embed coordination extensions in the code of the 
functional modules. 

The idea of hierarchical compositional languages implemented using con- 
figuration languages is not a recent idea [131 E]- Haskell# difference from 
its predecessor languages resides in its support for skeletons, by allowing to 
partially parameterize the concern addressed by components, and its ability 
for overlapping them, making possible to encapsulate cross-cutting concerns 
pi] . The use of skeletons has gained attention of parallel programming com- 
munity since last decade and now it is supported by many languages 
and paradigms [79]. The problem of modularizing cross-cutting concerns 
have gained attention in software engineering research community, partic- 
ularly for programming large scale object-oriented systems. An example 
of cross-cutting concern is validation procedures executed by processes for 
accessing computational resources in a grid environment. With respect to 
this feature, Haskell# may implement the notions of AOP (Aspect Oriented 
Programming) [52] and Hyperspaces [68] using an unified set of language 
constructors. Skeletons may be overlapped, forming more complex ones. 

Haskell^ programs may be translated into Petri nets. This allows to 
prove formal properties and to evaluate the performance of parallel programs 
using automatic tools. Some previous work have addressed the problem of 
translating Haskell# programs into Petri nets [561 [23]. The expressive power 
of HCL for describing patterns of interaction among processes is equivalent 
to descriptive power of labelled Petri nets [7T] . 

Now, relevant details about how Haskell^ programs are implemented 
are presented. HCL abstractions for programming at coordination medium 
are informally introduced and it is shown how simple components are pro- 
grammed in Haskell. Motivating examples of Haskell# are presented in Sec- 
tion 131 illustrating the use of Haskell^ programming abstractions. Appendix 
IB] formalizes an algebra for describing semantics of Haskell^ programming 
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Figure 1: Process Network of MCP- Haskell^ 



abstractions. The informal description points at the corresponding Haskell# 
algebraic constructions. 

2.1 Programming Composed Components 

Composed components, which form coordination medium of Haskell^ pro- 
grams, are programmed in HCL configurations. HCL programming corre- 
sponds to the inductive step in Haskell^ programming task described in 
last section. In what follows, the constructors used at coordination level 
for programming Haskell^ applications are informally introduced. Their for- 
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01. component MCP<n,m> with 
02. 

03. iterator i range [l,n] 
04. 

05. use Skeletons. {Pipeline, Workpool} 
06. 

07. interface IProbDef () — > (uscr_info, particles, tally_cntrics, rccip, avg_c, all_tallics) where: IDispatcher () — > particles 

08. behavior: seq { rccip!; avg_c!; all_tallics! ; tally_ciitrics! : uscr_info!; 

09. repeat particles! until particles } 
10. 

11. interface ITracking (user_info, particles*) — > (events*, totals) where: IPipeStage particles — y events 

12. behavior: seq { user_info?; do process_particles; totals! } 
13. 

14. interface ITallying (tally_entries, events*) —¥ tallies* where: IPipeStage events — > tallies 

15. behavior: seq { tally .entries?; do process_events } 
16. 

17. interface IStatistics (avg_e, recip, totals, tallies*) — y () where: ICollector tallies — y () 

18. behavior: seq { avg_e?; recip?; all_tallies?; repeat tallies? until tallies; totals? } 
19. 

20. unit pp; assign PipeLine<2> to pp 

21. unit wp; assign WORKPoOL<n> to wp 
22. 



assign ProbDef to prob_def 

assign Tracking to track 

assign Tallying to tally 

assign Statistics to statistics 



23. unit prob_def # IProbDef wire tally_entries all*2: distribute; 

24. unit track # ITacking 

25. unit tally # ITallying 

26. unit statistics # IStatistics 
27. 

28. factorize wp. manager in ~~y out to dispatcher # () — y out, collector # in — > () 
29. 

30. replace dispatcher # tallies — )- particles by prob_dcf # (_,particles, _,_,_,_) 

31. replace pp.stage[l] # particle — y events by track # (_, particles) — {events, _) 

32. replace pp.state[2] # events — y tallies, by tally # (_,events) — y tallies 

33. replace collector # tallies — y (), by statistics # (_,_,_,tallies) — )- () to manager 
34. 

35. replicate pp into n; [/ replace wp.worker[i] by pp[i] /] 
36. 

37. connect prob_def— >user_info to tracking ■<—user_info, synchronous 

38. connect prob_def— >tally_entries[0] to tallies-<— tally_entries, synchronous 

39. connect prob_def'— )-tally_entries[l] to statistics-*— tally_entries, synchronous 

40. connect prob_def—^ recip to statistics-*— recip, buffered 

41. connect prob.def— )-avg_e to statistics-*— avg_e, buffered 

42. connect tracking — ^-totals to statistics-*— totals, buffered 
43. 

44. replicate m statistics # (avg_e, recip, totals, tallies, tally_entries) — )- {) 

45. adjust wire avg_e: broadcast, recip: broadcast, totals: {# (map. sum. transpose) 

46. tally_entries<> : distribute, talliesO : broadcast 

Figure 2: HCL Code for MCP-Haskell# 

mal syntax is presented in Appendix |Al Appendix |B] brings their algebraic 
semantics. 

MCP-Haskell# A parallel version of MCP-Haskell [22] is used for exempli- 
fying the syntax of HCL. MCP-Haskell [39] is a simplified sequential version 
of MCNP, a code developed at Los Alamos during many years for simulating 
the statistical behaviour of particles (photons, neutrons, electrons, etc.) while 



they travel through objects of specified shapes and materials [15]. HCL code 
of MCP-Haskell^ is shown in Figure |2j The corresponding network topol- 
ogy is presented in Figure [H The parallelism is obtained from three sources. 
Firstly, tracking and tallying procedures must be executed concurrently using 



6 




a pipe-line. The main source of parallelism is the second. It comes from the 
fact that particles may be tracked and tallied independently. To take advan- 
tage of this problem feature, a work pool pattern of interaction is employed, 
where a manager process distributes jobs (particles) to worker processes, on 
demand controlled by their availability, and collects the results from each 
job. A third source of parallelism comes from the fact that the statistics of 
different taUies may be computed in parallel. Thus, each statistical process 
in the network is responsible for computing a specified set of tallies. In the 
following sections, it is explained how a HCL configuration may implement 
this network topology. 

A HCL configuration starts with a header, declaring the name of the 
composed component, its static formal parameters and its arguments and 
return points. MCP-Haskell:;^ has two static parameters, m and n, which 
controls the number of parallel tasks, but no argument or return point is 
defined. In general, arguments and return points are not defined for the main 
component of an application. They are normally used in the configuration 
of the encapsulated functional concerns. 
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2.1.1 The Basic Abstractions: Units and Channels 

A Haskell# configuration is specified by a collection of units, which are ab- 
stractions for agents that execute a particular task. Units synchronize using 
communication channels. The task performed by a unit is defined statically, 
by assigning a component for it. Units may be viewed as a "glue" for com- 
posing components. Units have interfaces, comprising collections of input 
and output ports. Interfaces are necessary for allowing units to be connected 
through communication channels. An interface also describes a partial order 
for the activation of ports during execution, characterizing the behavior of 
a unit. A communication channel is defined by linking two ports from op- 
posite directions through a communication mode: synchronous, buffered 
and ready. Communication modes of Haskell^ channels have direct corre- 
spondence to MPI primitives, ensuring their efficient implementation, and 
have semantic equivalence with OCCAM [16] and CSP [43j. Ports linked 
through a communication channel are said to be communication pairs. 

In Figure lines 20 to 26 have declarations of units, whose identifiers are 
placed after the keyword unit. The assign declarations bind components 
to units. The interface of a unit is declared after the clause In the 

example, an interface class identifier is employed but it is possible to declare 
an interface directly. This topic is discussed further in the next section. 

The low level of abstraction provided by units, ports and channels is not 
appropriate for programming large-scale and complex distributed parallel 
programs. Next sections introduce additional abstractions intended to raise 
the level of abstraction in HCL programming, simplifying the specification of 
large-scale and complex process topologies. Essentially, they provide support 
for partial topological skeletons. 

2.1.2 Interface Classes 

Haskell^ incorporates the notion of interface class for representing inter- 
faces of units that present equivalent behavior. Examples of interface class 
declarations are shown in lines 07 to 18 of Figure [2l The identifier of an 
interface class is configured after the interface keyword. The notation 
(ii, i2, ■ ■ ■ ,in) (oi, 02, ... , Om) sets up n input ports and m output ports, 
with the respective identifiers. In a where clause, an interface composition 
operator (#) allows defining how an interface is obtained from the compo- 
sition of existing ones. The semantics of the # operator is formalized in 
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Appendix [Bl 

Units that declare the same interface name after clause in unit dec- 
larations inherit the same behavior, specified in the corresponding interface 
declaration. 

A small language is embedded in behavior clause of interface decla- 
rations, intended to describe partial orders in the activation of ports. Its 
combinators have semantic equivalence to operators of regular expressions 
controlled by semaphores [17], which are regular expressions enriched with 
an interleaving operator, represented in HCL by the combinator par, and 
counter semaphores primitives, represented by the primitives wait and sig- 
nal. This feature ensures that the HCL descriptive power is equivalent to the 
power of terminal labelled Petri nets in describing the interaction patterns 
between processes. 

2.1.3 Wire Functions 

In an assignment declaration, it is necessary to map input and output ports 
of the unit to arguments and return points of the assigned component, re- 
spectively. The notation {ii,i2, ■ ■ ■ ,in) {01,02, ... , o^) may be used when- 
ever the order of ports does not match the order of corresponding argu- 
ments/return points. 

In fact, the association between the input and output ports and the ar- 
guments and return points of components in assign declarations defines how 
Haskell^ glues coordination and computation media. Whenever an argu- 
ment is not bound to an input port, an explicit value must be provided to 
it. Also, whenever a return point is not associated with an output port, it is 
not evaluated. 

In wire clauses of unit declarations, HCL allows to define a wire function 
that maps a value received through an input port onto a value that is passed 
to an argument. Analogously, it is possible to define a wire function that 
receives a value produced at a return point yielding another value that is 
sent through the associated output port. This increases the chances that a 
component be reused without changing its internal implementation, in such 
cases where there is some type incompatibility between the type or meaning 
of arguments and the return points and the expected input and output ports 
types and meaning at coordination level. 
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2.1.4 Groups of Ports 

Another useful feature of HCL is the rephcation of interface ports of a unit, 
forming groups of ports where individual members are referenced using enu- 
meration indexes. A group of ports is treated as an individual entity from 
the local perspective of the unit. Thus, they are bound to a unique argu- 
ment/return point and must be activated atomically. However, from a global 
view, individual ports of the group are treated in separate, being possible to 
connect them through different channels. 

Groups of ports may be of two kinds: any or all. When a group of 
input ports of kind all is activated, each port member must receive a value. 
The array of values received is mapped to a unique value by using a wire 
function. Then, the value is passed to the argument mapped onto the group 
of ports. When an output group of ports is activated, the value yielded at 
the return point mapped to it is transformed, using an wire function, into 
an array of values that are sent through port members of the group. In 
activation of groups of ports of kind any, one port belonging to the group is 
chosen among ports whose communication pair is activated. Once the port 
is chosen, communication occur like in individual ports. Notice that wire 
functions are necessary for configuring groups of ports of kind all. Because 
of that, groups of ports are configured in clause wire of unit declarations, 
like exemplified in Figure [2] with tally-entries group of ports. For configuring 
a group of ports of kind any, use any keyword instead of all keyword, as 
illustrated in the example. Figure H] illustrate semantics of wire functions 
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and groups of ports. 

2.1.5 Stream Ports 

Stream ports allows to transmit sequences of values (streams) terminated 
by an end marker. Haskell^ streams may be nested (streams of streams) at 
arbitrary nesting levels, which must be statically configured. Stream ports 
of units for which it is assigned a simple component must be mapped to 
argument and return points of lazy lists types in the functional module. 
Nested streams are associated to nested lazy lists of at least the same nesting 
level. 

In interface declarations in lines 11 and 14, stream ports may be iden- 
tified by the occurrence of sequences of symbols "*" after the identifier of 
the port. The number of *'s indicates its nesting level. For instance, stream 
ports particles and events of interface ITracking have nesting level equal to 
one. In Figure [9], where Haskell code of the functional module Tracking is 
presented, arguments and return points associated to particles and events 
ports of track unit are lazy lists of nesting level greater than or equal to one. 
Stream ports are essential for Haskell# expressiveness, once it allows over- 
lapping communication operations and computations during the execution of 
processes. The same approach is used by other parallel functional languages, 
such as Eden [H] . 

2.1.6 Configuring Arguments and Return Points of Composed 
Components 

Arguments and return points of composed components are, respectively, in- 
put and output ports of units that are not connected through any commu- 
nication channel. For speciying ports that must be connected to arguments 
and return points, HCL supports bind declarations. 

2.1.7 The Distinction Between Processes and Clusters 

It is convenient to distinguish between units associated to simple and com- 
posed components. The former are called processes, while the latter are 
called clusters. Processes are concrete entities and may be viewed as agents 
that perform sequential computations programmed in Haskell. Clusters are 
abstract entities and must be viewed as a parallel composition of processes. 
The abstraction of clusters is essential for expressing hierarchical parallelism. 
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For example, in a constellation architecture □, a cluster must be associated 
with a multiprocessing node, in such a way that its comprising processes are 
allocated to processors for shared memory parallel execution. Instead of gen- 
erating MPI code, the Haskell^ compiler could generate openMP |67] code 
for implementing communications among processes inside the cluster, more 
appropriate for multiprocessors. The support for multiple hierarchies of par- 
allelism is essential for grid computing architectures |1H] and is recognized as 
an important challenge for parallel programming languages designers [9]. 

In MCP-IIaskell# specification, and wp are clusters, units respectively 
associated to composed components PipeLine and Workpool, which rep- 
resent skeletons. Units prob_def, tally, track and statistics are declared as 
processes. The components assigned to these units are functional modules, 
written in Haskell. 

2.1.8 Termination of Haskell^ Programs 

Units may be declared as repetitive or non-repetitive. Non-repetitive units 
perform a task and go to their final state, while repetitive ones always go 
back to their initial state, for executing its task once more. In HCL, a unit 
is declared repetitive by placing a symbol "*" after the keyword unit in its 
declaration. For declaring a cluster as repetitive, all units belonging to the 
composed component assigned to it must be repetitive. Otherwise, an error 
is detected and informed by HCL compiler. 

A Haskell# program terminates whenever all non-repetitive units belong- 
ing to its main component terminates. If it has only repetitive units, it does 
not terminate. Repetitive units may be used to model reactive applications. 

A non-stream port of a repetitive unit may be connected to a stream port 
of a non- repetitive unit. Each value produced in the stream is consumed in 
an execution of the repetitive process. 

2.1.9 Virtual Units and The Support for Skeletons 

A skeletons was defined above as a composed component where its addressed 
concern is partially defined or totally undefined. Now that the structure of 
composed components was scrutinized, it is possible to define Haskell# skele- 
tons in more precise terms. In fact, the concern addressed by a composed 

"'^ Constellations have been defined as clusters of multiprocessor nodes with at least 
sixteen processors per node [H [28] . 
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Figure 5: Configuration of Units 



component is defined by the composition of concerns addressed by compo- 
nents assigned to its comprising units. If some unit of a component does not 
have a component assigned to it, it is said that the component is partially 
parameterized by its addressed concern. This kind of component is called 
a partial topological skeleton. Units not assigned to a component are called 
virtual units. In other skeleton-based languages, skeletons are usually total, 
in the sense that all units are virtual. After instantiating a partial topo- 
logical skeleton, or simply a skeleton, by assigning it to a unit comprising 
a configuration of a component, it is possible to assign components to the 
virtual units of the skeleton, configuring its addressed concern. 

The components FARM and Workpool are examples of total skeletons. 
They are used for structuring the topology of the MCP-Haskell^ program. 
They are instantiated by assigning them to units pp and wp, respectively. 
The replaces declaration, exemplified in lines 30 to 33 of Figure |2l takes a 
virtual unit from a skeleton and replaces it by another unit, such that there is 
an homomorphism relation from interface of the original unit to the interface 
of the new unit. This is formalized in Appendix [B] by the pair of relations 
C and □ between interfaces. Indeed, replacing declarations are syntactic 
sugaring of HCL. The same effect could be obtained by creating a new unit, 
unifying it with the skeleton unit and assigning the appropriate component to 
the resulting unit. For that reason, replacing declarations are not formalized 
in Appendix [Bl This topic is revisited in the next section, where unification 
is introduced. 

2.1.10 Operations over Virtual Units and Overlapping of Skele- 
tons 

Two operations are useful for the specification of complex topologies through 
the composition of skeletons: unification and factorization. Unification sub- 
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Figure 6: An Illustrative Example of Unification/Factorization 



stitutes a collection of virtual units by a new virtual unit, obeying the net- 
work connectivity and behavioral preserving restrictions formalized in the 
Appendix [Bl In this process, ports, individual or groups, may be grouped. 
To group groups of ports involves to merge their sets of ports. Factorization 
performs inverse operation of unification. It takes a unit and splits it in 
a collection of units, also respecting behavioral and networking connectivity 
preserving restrictions. It may be needed to replicate communication pairs of 
interface ports of a factorized unit for preserving network connectivity. Thus, 
it is also necessary to configure wire functions whenever a new group of ports 
is resulted from a factorization. For that, HCL provides clause adjust wire 
in unification and factorization declarations. 

In Figure El illustrative abstract examples of unification are presented, 
illustrating duality between these operations. A more concrete example of 
factorization is presented in line 28 of Figure [21 where manager unit from 
WORKPOOL skeleton is split up into units dispatcher and collector, dividing 
tasks realized by the manager. Unification does not appear directly in exam- 
ple of Figure |21 But replacing declarations, like discussed in the last section, 
is a syntactic sugaring of HCL that may be defined using unification. For 
instance, consider replacing declaration in line 31. It can be rewritten using 
the following equivalent code: 

unit track' # ITracking 

unify pp.stage[l] # particle ^ events, track' # (userJnfo, particles) ^ (events, totals) 
to track # ITracking (userJnfo, particles) — (events, totals) 

assign Tracking to track 



14 



Figure 7: An Example of Replication 



Unification, and consequently replacing declarations, allows for overlap- 
ping skeletons. In this sense, units from distinct skeletons may be unified 
forming a new unit. Overlapping of skeletons is not supported by other 
skeleton-based languages. In general, only nesting composition has been ad- 
dressed and cost models have been defined incorporating this feature [38] . 
A further step is to work on defining new cost models that incorporate the 
overlapping of skeletons. 

2.1.11 Replicating Units 

Another useful feature of HCL is to support replicate sub-networks from 
the overall network of the units described by the configuration. For that, 
a collection of units to be replicated and a natural number greater than 
one are provided. Network preserving restrictions must be observed, making 
necessary to replicate communication pairs of interface ports of a replicated 
unit, like in factorization. Wire functions must be provided to resulted groups 
of ports using the adjust wire clause. 

Replication is exemplified in line 35 of Figure |2l The unit pp is replicated 
into n units (joj'fi], < i < n — 1), which replace worker units of WORKPOOL 
skeleton. Figure [7] presents an illustrative example. 

2.1.12 Indexed Notation 

The # configuration language supports a special kind of syntactic sugaring 
for allowing to declare briefly large collections of entities. The iterator 
declaration employs one or more indexes and their ranging values. Syntactic 
elements that appear enclosed in [/ and /] delimiters (variation scopes) are 
unfolded, according to range of indexes that appears on its scope. The # 
compiler incorporates a pre-processor for unfolding indexed notation. 

In Figure [21 an iterator i is declared, varying from 1 to n. The replacing 
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Figure 8: Simple Components in Haskell 



declaration in line 35 is put in context of a variation scope. Thus, it may be 
unfolded in the following code, assuming that n = 3: 

replace wp.worker[l] by pp[l]; replace wp.worker[2] by pp[2]; replace wp. worker [3] by pp[3] 



2.2 Programming Simple Components 

Simple components are programmed using standard Haskell. No extensions 
are necessary to Haskell for gluing functional modules in the coordination 
medium. They are connected to units at the coordination medium by as- 
signment declarations, where a mapping between ports of the unit interface 
and argument /return points of the component is defined. Arguments of a 
functional module are represented by the collection of arguments of its func- 
tion named main, while return points are represented by the elements of the 
returned tuple. The general signature of main is shown in Figure [HI 

The main function may return values in the lO monad ^U\, but the I/O 
concerns may be resolved at coordination level using a skeleton that imple- 
ments an I/O aspect, an example of cross-cutting non-functional concern. 

Figure [9] presents the Haskell code for the functional module Tracking 
of MCP-Haskell#. Notice the correspondence of the arguments and return 
points with the ports of the unit track. Functional modules are programmed 
in pure Haskell. There is no reference in the computation code for any 
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module Tracking( mam) where 

import Track 
import Tallies 
import Mcp_types 

main :: User_specJnfo [(Particle, Seed)] ^ ([[Event]],[Int]) 

main userjnfo particlcjist = let events's = map f particlcjist in (events's, tally _bal eventJists) 
where 

f (particle® (_,_,_, e, _), sd) — (Create_source e): (track userJnfo particle [] sd) 
Figure 9: A Functional Module from MCP-Haskell# 



element declared at the coordination level of abstraction. Other examples 
of functional modules, enforcing these characteristics, are provided in Figure 

m 

2.3 Haskell^ in the Parallel Functional Languages Con- 
text 

Some authors have written papers on the evolution of parallel functional 
languages [571 SB EZ] • It is convenient to analyze the evolution of parallel 
functional programming by dividing it into two periods [57]. In first one, the 
decades of 1970 and 1980, parallelism was viewed as possibihty to make func- 
tional programs run faster. After that period, functional programming tech- 
niques have been viewed as a promising alternative to promote higher-level 
parallel programming, mainly motivated by the use of skeletons implemented 
using higher order functions [25j. 

The first attempts to embed the support for parallelism in functional lan- 
guages suggested the technique of evaluating function arguments in parallel, 
with the possibility of functions absorbing unevaluated arguments and per- 
haps also exploiting speculative evaluation [16]. However, the granularity 
of the parallelism obtained from referential transparency in pure functional 
languages is too fine, not yielding good performance on distributed architec- 
tures. Techniques for controlling granularity, either statically or dynamically, 
produced little success in practice [HI [73l [50] . Implicit parallelizing compil- 
ers face difficulties to promote good load balancing amongst processors and 
to keep the communication costs low. On the other hand, explicit parallelism 
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with annotations to control the demand of the evaluation of expressions, the 
creation/termination of processes, the sequential and parallel composition of 
tasks, and the mapping of these tasks onto processors yielded better results 
[m EH HSl |76l [m [86] . GpH adopts a semi-explicit approach, where program- 
mers may annotate the code, but responsibility to decide when to evaluate 
expressions in parallel is left to the compiler. Explicit approaches have the 
disadvantage of cross-cutting the computation and the communication code, 
not allowing to reason about these elements in isolation. Skeleton-based ap- 
proaches have obtained a relative success in parallel functional programming 
[261 Ha Ea HQ]. 

The coordination paradigm influenced the design of parallel func- 
tional languages in 1990s, being exploited from two perspectives. In the first 
one, it is used for abstracting parallel concerns from specification of compu- 
tations. Eden [H], Caliban [85j, and Haskell^ focuses on these ideas. In the 
second one, a higher-order and non-strict style of functional programming 
has been seen as a convenient way for specifying the coordination amongst 
tasks. SCL pS] and Delirium [BTj are examples of languages that employ 
the functional paradigm at coordination level, describing computations us- 
ing languages from other paradigms. Haskell# have other similarities with 
Eden and Caliban besides adopting the coordination paradigm and Haskell 
for describing computations. They all use constructors for explicit specifica- 
tion of network topologies where processes communicate through point-to- 
point and unidirectional channels. Like Eden, Haskell^ employs lazy lists for 
interleaving computation and computation and is strict in communication. 
Higher order values can not be transmitted through channels. Eden includes 
functionalities for specifying dynamic topologies, contraryse to Caliban and 
Haskell^. Static parallelism is an important premise of Haskell^ design, since 
it is intended to analyze Haskell^ programs by translating them into Petri 
nets. Also, Haskell# is oriented for high performance computing, where static 
parallelism is a reasonable assumption, and not for general concurrency. In 
the next paragraphs, some important distinguishable features Haskell^ are 
discussed. 

The Adoption of a configuration based approach for coordination. 

Configuration languages integrated to a lazy functional language like 
Haskell, allows a complete separation between parallelism and computational 
programming dimensions. No extensions are required to Haskell for program- 
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ming at computational level. Haskell and the HCL are orthogonal. Eden and 
Caliban, examples of embedded coordination languages, extend Haskell syn- 
tax with primitives for "gluing" processes to the coordination medium. GpH 
tries to separate parallel coordination code by using evaluation strategies 
[88] . Evaluation strategies is an interesting idea, but after inspecting some 
GpH programs that uses them, we noticed that a complete and transparent 
separation of the parallelism and the computation is very difficult to obtain. 
This is even worse when programmers want to reach peak performance of 
applications at any cost. The experience with Haskell#, and other parallel 
functional languages, has shown that a really transparent separation makes 
easier to parallelize existing Haskell programs. This increases opportunities 
for the reuse of code and allows independent specification and development 
of functional modules and coordination code, reducing programming efforts 
and costs. The ability of composing programs from parts using the config- 
uration approach also makes Haskell# more suitable for programming large 
scale high-performance applications than other parallel functional languages 
[231 • Programmers are forced to adopt a coarse grained view of paral- 
lelism that is convenient for clusters and grids. 

The Modelling of parallel architectures. Developing general techniques 
for freeing programmers from making decisions on the allocation of processes 
to processing nodes of a parallel architecture is an old challenge to the par- 
allel programming community. However, this problem is hard to be treated 
in general. Existing mechanisms for this purpose, either dynamic or static, 
apply efficiently to restricted instances of the general problem and some of 
them are based on heuristics. With the advent of grids, cluster of heteroge- 
neous nodes, constellations, etc, it is not expected that a unified approach, 
covering all realistic cases, may appear. Because of that, Haskell# follows a 
static and explicit approach for process allocation, as in Caliban. Eden and 
GpH, on the other hand, let allocation decisions to the compiler. In Haskell^, 
it is possible to model both processes needs for optimal execution and ar- 
chitecture characteristics by using partial topological skeletons for treating 
allocation as an aspect. Each skeleton may be implemented using specific 
allocation policies convenient for different architectures. 

The analysis of formal properties using Petri nets. The support for 
proving and analysing of formal properties of parallel programs by using 
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Petri nets is one of the most important premisses that guided the design of 
Haskell^. A compiler that translates HCL configurations into INA [SU], a 
Petri net analysis tool, was developed |55]. In |23], a new translation schema 
incorporating some extensions to the original HCL was presented. Recently, 
a new translation schema has appeared and we are working on a new compiler 
for translating Haskell^ programs into PNML [HI], a format supported by 
many Petri net analysis tools, and SPNL |3B] , for analysing the performance 
of Haskell# programs by using stochastic Petri nets. TimeNET [92] will be 
used for this purpose. Other parallel functional languages do not support 
formal analysis of parallel programs. 

Simple and portable implementations . Unlike other parallel func- 
tional languages, it was not necessary to modify or extend the run-time sys- 
tem of GHC for implementing Haskell^. Indeed, any Haskell compiler could 
be used in alternative to GHC, with all optimizations enabled. Haskell^ pro- 
grams take advantage of the evolution of compilation techniques with little 
efforts. Eden, for example, modifies GHC compiler and disables some of its 
optimizations [69] . Modifications to the run-time system of the Haskell com- 
piler makes difficult to adapt the parallel language extension to new versions 
of the compiler. In Haskell^, internal changes to the GHC run-time system 
do not require modifications to the code generated by the Haskell# compiler. 
Only if the interface of some used library is changed, minor modifications 
are necessary. GpH and Eden developers should also carefully analyze the 
effects of modifications to their parallel run-time system. 

Efficiency. Potentially, Haskell^ compiler may generate efficient MPI code 
without using advanced compilation techniques for parallel code. This is due 
to the direct correspondence of HCL constructors to MPI primitives and the 
use of skeletons to abstract specific interaction patterns. Languages that 
use higher-level constructors, in the sense that parallelism is transparent or 
implicit, have difficulties on promoting the generation of MPI code able to 
take advantage of peak performance in cluster architectures and, mainly, in 
grid computing environments. 
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Figure 10: HCL Topologies For Matrix Multiplication Solutions 

3 Motivating Examples 

This section presents Haskell# implementations for three applications re- 
cently used for benchmarking the parallel functional languages Eden, GpH 
and PMLS: Matrix Multiplication, LinSolv and Ray Tracer [59j. A Haskell^ 
implementation for a sub-set of NPB (NAS Parallel Benchmarks) [2] is also 
presented. These applications will be used in Section E] for performance eval- 
uation of the current Haskell^ implementation, presented in Section |H 



3.1 Matrix Multiplication 

Given two square matrices A,B G Z"^", n G N, a matrix C G Z"^" is 

n 

calculated, such that Cij = ^ Ai^k * Bkj. 

k=l 

A trivial, fine-grained, parallel solution requires n x n processors. Each 
processor computes an element Cij, from scalar product of row i oi A and 
column j of B. This solution is obviously impractical, since large matrices are 
common in real applications, requiring a number of processors not supplied 
by contemporary parallel architectures. Three approaches are commonly 
used in order to aggregate computations for increasing granularity [5^ : 

• Row Clustering: each process computes a set of rows of C. For that, 
the process needs the corresponding set of rows of A and all matrix B; 
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{- In FILE: ManagerSkelBC.WP.hcl -} 
configuration IMakagerSkel a (t),c) where 

use ReadMatrix.WriteMatrix — functional modules 

unit rA # () — »■ (a: rVMatrix) ; assign ReadMatrix to rA 
unit rB # () ^ (b::VMatrix); assign ReadMatrix to rB 
unit wC # c::Matrix — >■ {) ; assign WriteMatrix to wC 

{- In FILE: MatMultBC.WP.hcl -} 
configuration ]MatMult<N> where 

iterator i range [1,N] 

use Skeletons. WoRKPOOL 

use MatrixMult. ManagerSkel — MatrixMult is a func. module 
import MatrixMult_WF(splitM,combinGM) 

unit wa; assign WORKPOOL<N> to wa 
unit wb; assign WORKPOOL<N> to wb 

unify wa. manager # c — )- a, wb. manager # c — >■ b 
to manager # c — > (a, b) 

[/ unify wa.worker[i] # a — ^ c, wb.workcr[i] # b — c 

to worker[i] # (aiiVMatrix, h:: VMatrix) — y cr.Matrix 
assign MatrixMult to workcr[i] /] 

assign ManagerSkel to manager 



{- In FILE: MatMultBC.Farm.licl -} 
configuration MatMult<N> where 

iterator i range [l,N] 

use Skeletons. Farm 

use ReadMatrix, MatrixMult. WriteMatrix — functional modules 
import MatrixMult_WFs(splitM,splitM_T,combineM) 

unit farm_a ; 

assign Farm<N, splitAI, combineAI > to farm_a 
unit farm_b ; 

assign Farm<N, split AIJT, combineAI > to farm_b 

/[ unify farm_a.worker[i] # a ^ c, 
farm_b.worker[i] # b — >■ c 
to worker[i] # (a::VMa£rzx, h::VA4atrix) —J- cwAdatrix 

assign MatrixMult to workcr[i] 

/] 

unify farm_a. collector # c — > (), 
farm_b. collector # c — ^ {) 
to collector # c;:Matrix — > () 

assign WriteMatrix to collector 

assign ReadMatrix N — f a to farm_a. distributor # () — f a 
assign ReadMatrix N — > b to farm_b. distributor # () — > b 



Figure 11: Haskell^ Configuration of Block Clustering using Workpool 
and Farm 

• Block Clustering: each process computes a block of the resulting 
matrix C . For that, the corresponding rows of A and columns of B are 
needed; 

• Gentleman's algorithm: the processes are organized in a torus (cir- 
cular mesh) for performing a systolic computation [78]. Each process 
computes a block in C. At initial state, the corresponding blocks in 
A and B are arranged across processes. Then, they execute k steps, 
where k is the number of rows and columns of processes. At each step, 
a process sends the blocks from A and B that it contains to its left 
and down neighbors, and receive new blocks from right and top neigh- 
bors. A local computation is performed and the resulting matrix is 
accumulated. 

The above solutions differ on the number and size of messages exchanged. 
In Haskell# programs, composition of skeletons may be used to describe 
topologies for the solutions. Firstly, consider implementations of row and 
block clustering using Workpool skeleton, where a manager process dis- 
tributes rows or blocks, respectively, as jobs to a collection of worker pro- 
cesses, on demand of their availability. Once a worker finishes a job, it sends 
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{- In FILE: MatMultTorus . hcl -} 
configuration MatMult<N> where 

iterator i range [l,N*N] 

use Skeletons. {Torus, Farm} 
useREADMATRix, MatrixMult, WriteMatrix 
import MatrixMult_WFs(splitM,combineM) 

unit farm_a; assign FARM<N*N,splitM,combincM> to farm_a 
unit farm_b; assign FAR.M<N*N,splitM,combincM> to farm_b 
unit torus: assign TORUS<N> to torus 

[/ unify farm_a.worker[i] # a — )■ c, 
farm_b.worker[i] # b — J- c, 

torus. ccll[i/N][i%N] # (as_l,bs_t) (as_r,bs_d) 

to ccll[i/N][i%N] # (ar.Matrix, h::Matrix, asJ:: [Matrix], bs_t:: [Matrix]) 
— > {c: : M atrix, as_r :: [Matrix], bs_d:: [Matrix]) /] 

unify farm_a. collector # c — > (), farm_b. collector # c — >■ () to collector # c::Matrix — >■ {) 

assign ReadMatrix N — ^ a to farm_a. distributor # () a 
assign ReadMatrix N — j- b to farm_b. distributor #()—)■ b 
[/ assign MATRIXMult to cell[i/N] [i%N] /] 
assign WriteMatrix to collector 



Figure 12: Systolic Matrix Multiplication using a Torus (HCL Code) 



its result back to the manager and stay available for receiving another job. 
This technique is suitable when the number of jobs exceed the number of 
processors available. Load balancing is automatically achieved in architec- 
tures where processor workload or performance may vary. Because of that, 
it has been widely used in grid computations ^34j- The unit manager in 
the Workpool skeleton in Haskell# has two groups of ports of kind any: 
one for sending jobs to workers and another for receiving results from them. 
Workers receive jobs from their input ports and send results through their 
output ports. 

Row and block clustering may also be implemented using Farm skeleton. 
Now, a master process sends a job to each slave process. Ideally, jobs have 
similar workload. After completing a job, slaves send the result to their mas- 
ter and finish. The master combines the solutions received from all slaves. 
This approach may reduce significantly the number of messages exchanged 
and minimizes the communication overheads by using underlying collective 
communication primitives. In fact, the Farm skeleton is defined by overlap- 
ping of Gather and Scatter skeletons. Farm employs wire functions for 
distributing and combining values sent to and received from slave processes. 
For achieving better load balancing, processors must be homogeneous. This 
is a reasonable assumption to be made in cluster architectures, but not in 
grid ones. 
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module MatMult_Toroidal(?7iam) where 
import MatrixTypcs 
import List 

main :: Int Int -i Matrix Matrix 

[Matrix] -i [Matrix] -i (Matrix, [Matrix], [Matrix]) 
main — mult' 

mult' nc nr sml sm2 smls sm2s — (result, toRight, toDown) 
where toRight — take (nc-1) (sml:smls) 
toDown — take (nr-1) (sm2':sm2s) 
sm2' — transpose sm2 
sms — zipWith multMatricesTr 

(sml: smls) (sm2' :sm2s) 
result — foldll' addMatrices sms 

addMatrices :: Matrix -i Matrix Matrix 
addMatrices ml m2 — zipWith addVectors ml m2 
where addVectors :: Vector Vector -i Vector 
addVectors vl v2 ^ zipWith ( + ) vl v2 

multMatricesTr :: Matrix Matrix Matrix 
multMatricesTr ml m2 — 

[[prodEscalar row col | col j- m2] | row j- ml] 

foldll' :: (a-^a-^a) [a] a 
foldll' f (x:xs) ^ foldr f X xs 

foldl' :: (a b a) -i a -i [b] -i a 
foldr f a [] ^ a 

foldl' f a (x:xs) — foldl' f (f a x) xs 

prodEscalar :: Vector Vector Mylnteger 
prodEscalar vl v2 — sum (zipWith (*) vl v2) 



module LS_homSol(Tnam) where 
import Matrix 

import LUDecompMatrix (det, homsolv) 
import qualified Matrix (determinant) 
import ModArithm 

main :: (SqMatrix Integer, Vector Integer) 

-I [Integer] -i [[Integer]] 
main (a,b) — gen_xList 
where 

gen_xList :: [Integer] [[Integer]] 
gcn_xList ps — map gct_homSol ps 

gct_homSol :: Integer [Integer] 
gct_liomSol p — 
let 

bO — vccHom p b 
aO — matHom p a 

modDct — modHom p (determinant p aO) 

pmx — homsolvO p aO bO 

( (iLoJLo) , (iHi,jHi) ) — matBounds a 

in 

(p : modDct : if modDct — — 
then [0] 
else pmx) 

slow .determinant :: SqMatrix Integer Integer 
slow .determinant — Matrix. determinant 

determinant :: Integer -i SqMatrix Integer Integer 
determinant — det 

homsolvO :: Integer SqMatrix Integer 

Vector Integer [Integer] 
homsolvO p aO bO — vecCont v 
where 

v — homsolv p aO bO 



Figure 13: Functional Modules of Matrix Multiplication and LinSolv 

Figure [H] presents the Haskell# configuration codes for block clustering 
using WORKPOOL and Farm skeletons. Two matrices are distributed, thus 
it is necessary to overlap two instances of both skeletons, as illustrated in Fig- 
ure [TOl The units readA, readB and writeC are clustered to implement the 
manager process. The implementation of row clustering makes use of identi- 
cal topological description. Differences are on port types and implementation 
of computations. This evidences the importance of reuse and composition in 
Haskell# programming. 

The Gentleman's algorithm is implemented by overlapping two instances 
of the Farm skeleton, one for each input matrix, with a ToRUS skeleton, 
as in Figure [TUJ The ToRUS describes the interaction pattern among slave 
processes from the overlapped Farms. The HCL code for this arrangement 
is presented in Figure [121 

Haskell# components that implement the solutions above have the same 
names and interfaces. Only internal details, concerning the parallelism strat- 
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Figure 14: Haskell^ Topology For LinSolv 



egy adopted, varies. Thus, they can be used interchangeably in an application 
by nesting composition. The Haskell^ visual programming environment al- 
lows several component versions to co-exist. The programmer may choose 
the appropriate version, depending on the target parallel architecture. For 
instance, implementing matrix multiplication using Farm may be more ef- 
ficient in clusters. In grids, a Workpool may prove more suitable. In 
supercomputers where processors are organized in a torus, the toroidal solu- 
tion may be the best choice. 

3.2 LinSolv 

Given a matrix A G Z"^" and a vector b G Z'^, n & N, find an exact solution 
to the linear system of equations of the form Ax = b. 

The solution described here is exact and operates over arbitrary precision 
integers. A multiple homomorphic image approach is adopted [54j, consisting 
of three stages [SS]: 

1. map the input data into several homomorphic images. The domain of 
homomorphic images is Z modulo p {Zp), where p is a prime number; 
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{- In FILE: ManagerLS.WP.hcl -} 

configuration LS_Manager<N> # {ab,xList) — >■ primes where 
use LS-Input, LS.Primes, LS_CRA, LS.Output 

unit input # () — y (ab, pBound, n) ; assign LS_lNPXJT to input 
unit primes # (unlueky .primes*, pBound) — ^ (guessedNoOfPrimes, primes*) ; assign LS_Primes to primes 

unit era # (n, guessedNoOfPrimes, xList*) — {c,unlueky .primes*) ; assign LS_CRA to era 

unit output # e ^ ; assign LS.OUTPUT to output 

connect primes —y guessedNoOfPrimes to era guessedNoOfPrimes 
connect input_ab — y n to era < — n 

connect input_ab — y pBound to primes -4— pBound 

connect era — y unlucky_primes to primes < — unlucky_primes 

connect era — y c to output e 

{- In FILE: LinSolv.WP.hcl -} 
configuration LikSolv<N> where 

iterator i range [0,N-1] 

use Skeletons. {Collective. BCast, Workpool} 
use LS_Manager, LS_HomSol 

unit bcast_ab ; assign BCast<N> to bcast_ab 
unit wp ; assign WORKPOOL<N> to wp 

interface ILinSolv (ab, job) — > (abjob) where: ah@IBCast # 3oh<^ I Workpool 
behavior: seq {do ab; do job} 

unify wp. manager, beast_ab.root to ls_iTianager ^ ILinSolv 
assign LS_Manager to ls_managcr 

[/ unify wp.worker[i] , beast_ab.peer[i] to ls_workcr[i] # ILinSolv 
assign HomSol to ls_worker[i] /] 



Figure 15: LinSolv using a WORKPOOL skeleton (HCL Code) 

2. compute the solution in each of these images, using LU-decomposition 
followed by forward and backward substitution; 

3. combine the results of all images into a result in the original domain, 
using a fold-based CRA (Chinese Remainder Algorithm) [58] . 

The parallel strategy implemented in Haskell^ is based on Eden and 
GpH versions [50]. A manager process distributes computations of homo- 
morphic solutions as jobs to a collection of worker processes. The skeleton 
Workpool was adopted to distribute prime numbers to workers and to 
collect computed homomorphic solutions. The BCast collective communi- 
cation skeleton is used for distributing working data {A and h) to the workers. 
The Haskell# configuration code that implements this arrangement is pre- 
sented in Figure [151 A composed component LS -Manager is configured 
for aggregating computations of functional modules LS_Input (obtains in- 
put data A and 6), LS_Primes (computes the list of primes for calculating 
homomorphic solutions), LS_CRA (aggregates homomorphic solutions us- 
ing Chinese Remainder Algorithm), and LS_Output (outputs result x). In 
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Figure 16: Haskell^ Topology Composition For Ray Tracer 



composed component LinSolv, the main component, a cluster is created 
by assigning LS -MANAGER to unit Is^manager, which is configured in such 
a way that it makes the role of root unit in BCast skeleton and manager 
of WORKPOOL skeleton. The functional module LS_HomSol implements 
computation of a homomorphic solution for a given prime number. It is as- 
signed to units ls-Worker[i\, foT < i < N — 1, obtained by unification of 
worker units of Workpool and peer units of BCast. Notice that these 
skeletons are overlapped. The cluster ls_manager might be placed onto a 
multiprocessor node, in such a way that processes input, primes, era and 
output could execute concurrently. Figure [H] illustrates topological specifi- 
cation of LinSolv. Figure [13] shows examples of functional modules of Matrix 
Multiplication and LinSolv. 

3.3 Ray Tracer 

Given a collections of objects in the three dimensional space, calculate the 
corresponding two dimensional image. All rays in a window (for each pixel in 
the grid) are traced and their intersections with objects are computed. The 
colour of an intersection point is computed based on the strength of the ray 
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{- In FILE: RT_Manager.hcl -} 

configuration RT_]Manager<N> # (rt_raytraccr-«— xy, world) —J- (rt_paramctcrs— )-xy [2] , world) where 
use RT_Parameters, RT_Result — functional modules 

unit rt_parametcrs # () — ^ (xy<2>, world) groups yiy •.broadcast assign RT_Parameters to rt_param.ters 
unit rt_output # (xy,res) — > (); assign RT_Result to rt_output 

unit rt_raytraccr # (xy,world) — > res 

connect rt_paramctcrs — )- xy[l] to rt_output -f— xy 

{- In FILE: RayTracer.licl -} 
configuration RavTracer<N> where 

iterator i range [0,N-1] 

use BCast, Scatter, Gather from Skeletons. Collctive 
use RT_RayTracer — functional module 

unit bcast_world; assign BCast<N> to bcast_world 
unit scatter_xy ; assign ScATTER.<N> to scatter_xy 
unit gathcr_rGS ; assign Gather<N> to gather_rcs 

interface IB.ayTracer (xy.*, world.*, res.*) — ^ (xy.*, world.*, res.*) 
where: (xy@IBCast # world® I Scatter # res^IGather) 
behavior: seq {do world; do xy; do res} 

[/ unify bcast_ab.pccr[i] , scatter_world.pecr[i] , gathcr_rcs.pccr[i] to rt_workcr[i] # IRayTracer 
assign RT_RayTracer to rt_workcr[i] 

/I 

unify bcast_world.root, scatter_xy.root, gather_res.root to manager # IRayTracer 

assign R.T_Manager. to manager 

unify manager. rt_raytracer, rt_workcr[0] 



Figure 17: Ray Tracer (HCL Code) 

and texture of the object reached [59] . 

A data parallel solution is trivial, since rays can be traced independently 
for each pixel. In Haskell# implementation, a direct mapping of the image 
lines to parallel processes, assuming one at each processor, is employed. 
Each process receives the same number of lines to compute. This solution 
yields load balancing in homogeneous clusters. The HCL for ray tracer is 
presented in Figure [17] and its topology is described in Figure [161 It is imple- 
mented by overlapping three skeletons: BCast, Gatherv and Scatter. 
The root units of these skeletons are unified to form the manager unit, re- 
sponsible for distributing and collecting work among worker units, obtained 
by overlapping their peer units. The manager also acts as a worker. Distri- 
bution and collection are specified by wire functions. The BCast skeleton 
disseminates the world scene to workers. Scatter and Gatherv are used 
to distribute jobs and collect the results from the workers. 
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3.4 NAS Parallel Benchmarks 



This section presents the Haskell# implementations for a sub-set of NPB 
{NAS Parallel Bechmarks) [2J, a package comprising eight programs, specified 
in NASA Research Center at Ames, USA, intended to benchmark the per- 
formance of parallel computing architectures for execution of the NAS [Nu- 
merical Aerodynamic Simulation) programs. NPB programs implemented in 
Haskell^ are: 

• EP [Embarrassingly Parallel) generates pairs of Gaussian deviates ac- 
cording to a specified scheme and tabulates the number of pairs in suc- 
cessive square anulli. It was developed to estimate the upper achievable 
limit for floating point performance in a parallel architecture; 

• IS [Integer Sorting) performs parallel sorting of N keys using bucket 
sort algorithm. Keys are generated using a sequential algorithm de- 
scribed in [3] and must be uniformly distributed; 

• CG [Conjugate Gradient) implements a solution to an unstructured 
sparse linear system, based on conjugate gradient method. The inverse 
power method is used to find an estimate of the largest eigenvalue of 
a symmetric positive definite sparse matrix with a random pattern of 
non zeros; 

• LU [L U factorization) uses symmetric successive over-relaxation (SSOR) 
procedure to solve a block lower triangular-block upper triangular sys- 
tem of equations resulting from an unfactored implicit finite-difference 
discretization of the Navier-Stokes equations in three dimensions; 

NPB programs exercise the expressiveness of HCL for describing SPMD 
programs and for translating MPI programs into Haskell^. LU gave us an 
important insight on how to facilitate programming of applications where 
processes have a large number of input and output ports. CG and IS help 
on evaluating the performance of collective communication skeletons. 

3.4.1 The Embarassingly Parallel (EP) Kernel 

The HCL code of EP is presented in Appendix IC.li It declares n units, 
named ep_unit[i], for 1 < i < n. The interface class that describes the be- 
havior of these units, lEP, is formed by the composition of three instances 
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Figure 18: EP Topology 



of lAllReduce interface class, called sx, sy and q. The definition of channels 
is specified by overlapping three instances of the AllReduce skeleton. For 
that, clusters sx^comm, sy-comm, and q-comm are associated with AllRe- 
duce component and their virtual units are unified. The HCL compiler uses 
the topological information provided by AllReduce skeleton and generates 
code that uses the MP I _ AllReduce primitive of MPI. 

3.4.2 The Integer Sort (IS) Kernel 

The HCL code of IS is shown in Appendix IC.2I It declares a network of n 
units, named isjanit[i\, for 1 < i < n. The interface class for describing 
the behavior of IS units, called IIS, is a composition of interfaces lAllRe- 




Figure 19: IS Topology 
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duce, lAllToAUv and IRShif. A cyclic pattern of communication (repeat 
combinator) now appears, due to presence of stream ports on specification 
of IIS. 

IS network topology is defined by overlapping skeletons AllReduce and 
AllToAllv, for collective communication, and RShift, which performs a 
data shift right amongst processes. Cluster units bs_comm, kb-comm and 
k_shift are assigned to them, respectively, and their virtual units are unified. 
The interface components bs, kb, rshift of IIS indicate which ports of IS 
units participate in the skeleton instances, respectively. 



3.4.3 The Conjugate Gradient (CG Kernel) 

The original topology of CG, specified in FORTRAN/MPI, imposes that the 
number of processes, organized in a rectangular mesh, is a power of two. 
The version of CG in Haskell^ is less restrictive. The programmer must 
provide parameters dim (the number of mesh rows), and coLf actor (the 
number of mesh columns is obtained by multiplying it to dim). Any number 
of units may be configured using this approach, but different configurations 
may result in different performance. The programmer should adequate the 
parameters values to the features of the execution environment. CG units 
cgjunit[i][j], for 1 < i < dim and 1 < j < dim * col-factor. The HCL code 
of CG is presented in Appendix IC.3I 

The interface class that describes the behavior of CG units, /CG, is a 
composition of interface classes lAllReduce [rho, aux, rnorm, normJtempA 
and normJempJ2) and ITranspose (g and r). CG topology is defined by 
overlapping AllReduce and Transpose skeletons. The former is used for 
data exchange during parallel scalar products at mesh rows, and the latter 
for data exchange in parallel matrix multiplications, whenever a transpose 
operation is performed on data stored in processors. In MPI original code. 
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several calls to MPLIrecv primitive are needed to perform these operations, 
making difficult to understand the structure of the topology without a careful 
analysis of the parameters of the problem. 

Five clusters are needed for each row of processes: rho-Comm[i], aux-Comm[i], 
rnorm_comm[i], normJ:empJl[i], and normJ:empJ2[i], 1 < i < rows. The 
AllReduce component is assigned to them. The Transpose component 
is assigned to the other two clusters, q_comm and rjcomm, encompassing all 
processes in the network. Their units are unified producing the final Haskell# 
topology of CG. 

The Haskell# configuration code of Transpose is presented in Appendix 
IC.3.11 It organizes virtual units according to parameters dim and coLf actor, 
supplied by CG configuration. Firstly, a square mesh of units with dimension 
dim is assembled. The ports are connected to transpose data amongst pro- 
cessors using appropriate wire functions applied on groups of ports. These 
units are factorized in col-factor units, resulting in a square mesh with dim 
rows and dim * coLf actor columns. The diagram in Figure |20] illustrates the 
factorization process involved in Transpose specification. In order to make 
it easier to understand, only channels connected to «[!][!] ports are shown. 
They are replicated according to factorization rules. 

3.4.4 The LU Factorization (LU Simulated Application) 

The HCL code of LU is presented in Appendix IC.4I LU organizes n pro- 
cess, where n is a power of two, in a grid. It employs the wavefront method 
[H] in parallel computation. It differs from other NPB programs because 
communication is performed by small messages of approximately 40 bytes. 
Another particularity of LU is the great number of communication ports in 
units (thirty input ports and thirty output ports). Skeletons Exchange_1b, 
Exchange_3b, Exchange_4, Exchange_5, and Exchange_6 describes 
communication topologies in several communication phases during execution, 
using the wavefront method. The same nomenclature employed in the origi- 
nal LU versions are used here to make easier to compare the two approaches. 
In these skeletons, there are several interfaces for virtual units that comprise 
them. Their specification vary according to their position in the grid. In- 
terface generalization is useful in such cases, avoiding classes of units to be 
treated individually in the configuration. 
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Figure 21: Simple Components in Haskell 



4 Implementation 

Haskell^ may be implemented on top of a message passing library and a 
sequential Haskell compiler, without any modifications or extensions to any 
of them. MPI 1.1 and GHC (Glasgow Haskell Compiler) are currently used, 
respectively. MPI is now considered the most efficient message passing li- 
brary for clusters, providing standard bindings for C and Fortran. Recently, 
MPI versions for grid computing have appeared [19]. GHC is now considered 
state-of-the-art techniques for the compilation of lazy functional programs. 
It supports FFI (Foreign Function Interface) [21] to make direct calls to MPI 
routines from Haskell programs. The use of an efficient sequential Haskell 
compiler has important impact on performance of Haskell^ programs, since 
Haskell# programs assumes medium and coarse grained parallelism, where 
most of time is spent in sequential mode of execution. Haskell# implemen- 
tations are easily portable to new MPI and GHC versions. Indeed, it is 
possible to replace GHC with any Haskell compiler that supports FFI. All op- 
timizations and extensions provided by the Haskell compiler may be enabled. 
This is an important feature of Haskell#, since other parallel functional lan- 
guages built on top of GHC need to modify its run-time system. The current 
Haskell^ implementation has already been tested on top of LAM-MPI 6.5.9 
[17J, MPICH 1.2.5.2 [33 and GHC versions 6.01 and 6.2 in clusters equipped 
with RedHat Linux 8.0 and 9.0. 
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4.1 An Overview of the Haskell^ Compilation Process 



The Haskell# compiler has been entirely programmed in Haskell, using Alex 
2.0 [30] and Happy 1.13 [62] for parsing. It is divided into two modules: 
front- end and back-end. The compilation process is illustrated in Figure 
The front-end module parses all components of a Haskell^ program, 
by traversing its tree of components, from application component to simple 
components. A flat representation of the processes network is generated. 
Relevant topological information, obtained from the use of skeletons, that 
could guide the back-end for the generation of optimized code is stored. The 
flat code is currently represented as an algebraic data type in Haskell, but it 
is intended to implement it in XML (Extended Markup Language), allowing 
to use it as an intermediate language for interfacing tools for the analysis 
performance and formal properties in the programming environment under 
development. 

The back-end uses flat code and topological information for generating a 
wrapper module for each process and for inferring the mapping of processes 
onto processors of the target architecture. A wrapper module is a Haskell 
program that controls the execution of a process. The wrapper modules and 
the functional modules are compiled using GHC The mapper is a program 
that copies executable files onto the target machine where it will execute, 
based on the mapping of processes onto processors inferred by the back-end. 

4.2 Wrapper Modules 

In Figure [22|, the structure of a wrapper module is illustrated. A wrap- 
per makes a call to the main function of the functional module associated 
to with the process. The values produced at return points (r^, 1 < z < 
k) are copied concurrently to channel variableJ^ (chanjTj), using functions 
send-stream and send^atom, depending on the nature of the associated out- 
put port. The arguments provided to the main function (oj, 1 < j < n) 
may also be obtained from channel variables (ON_DEMAND chaujTj) or di- 
rectly (FORCED action j)., on demand of evaluation of return points. The 
function perform_actions controls the completion of the communication oper- 
ations, according to a guide automaton that recognizes the behavior specified 
in the process interface. Whenever an output port must be activated, per- 
form_actions evaluates perform_communication, which reads a value from the 

^Type Chan t from Concurrent Haskell [74] . 
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module Main{main) where 
import Systcm(gct Args) 

import Concurrcnt(forkIO, Chan, ncwChan, newQSemN, waitQSemN, signalQSemN) 
import HHashSiipport 

import qualified [Functional Module ^(msiin) 

\import declarations that appear in # code}, 

main :: lO () 
main = do 

argv getArgs 

argc -4— {return. length) argv — MPI initialization 
give_args [] argv {mpijinit BUFFER_SIZE argc) 

a\-cha7i :: Chan (Comm \Channel Typel) newChan — Initializing channel variables for arguments 
a^-chan :: Chan (Comm [Channel Typei) newChan 

On-chan :: Chan (Comm [Channel Typei) newChan 

ri^chan :: Chan (Comm [Channel Type^) ■<— ncwChan — Initializing channel variables for return points 
r2-chan :: Chan (Comm [Channel Type^) i— newChan 

ri^^chan :: Chan (Comm [Channel Type^) i — newChan 

FOR EACH p. A_\ 1MII\ IIU AL PORT OR GROUP OF PORTS INVOLVED IN A COLLECTIVE OPERATION: 
p i — [mpijregisterjport ■ ■ ■ \ mpi_register_peer . . .] 

let comm_p = [SinglelPort | SingleOPort | GroupIPort | GroupOPort | Beast | Gather | Scatter | 

Scatterv | Allgather | Allgatherv | Allreduce | Alltoall | Alltoallv | Reduce_Scatter | Scan] p ■ ■ ■ 

caut i — [code to setup guide automatai, 
control-automata^init caut 

sync <— newQSemN 

forklO (perform^actions n signalQSemN sem) 

let ai = [recv^strcam \ recv.atom] [ONJDEMAND actioni \ FORCED chan.ai] 
a2 = [recv^stream \ recv.atom] [ONJ3EMAND action^ \ FORCED chan-a2\ 

On = [recv^tream \ recv-atom] [ON_DEMAND actiortn \ FORCED chan-an] 

(ri,r2,- ■ -j^^fc) = {Punctional Module i.Ta.a.iii ai ■ ■ ■ ctn 

forklO ([senfLsiream | aend-atom\ ri-chan ri a signalQSemN sync) 
forklO {[send-stream \ send-atom] r2-chan r2 II signalQSemN sync) 

forklO {[sendLstream \ sendLatom] Vk-chan r^ Li signalQSemN sync) 

waitQSemN (k+1) sem 



Figure 22: Wrapper Module 
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Figure 23: Schematic Representation of a Wrapper Module 



corresponding return point and sends it through the active port. For input 
ports, perform_communication may be called inside recvstream or recv-atom 
functions, when an argument value is demanded. In this case, the operation 
is validated by the guide automaton and a channel variable is not necessary. 
However, in some collective communication operations, when a process sends 
and then receives a value (the root process in a broadcast, for example), it is 
needed to write and read, in a single call to perform-communication, channel 
variables associated to a return point and to an argument, respectively. This 
is a situation where a channel variable is necessary for an argument. The 
Haskell^ compiler forces evaluation of the input ports inside perform_actions 
whenever it may infer that an input port must be strictly activated before 
the activation of some output port. This is typical when the alt (choice) 
constructor does not occur in process behavior specification. Figure |23] illus- 
trates the use of channel variables. 

Since processes spend some time with synchronization, concurrent eval- 
uation of perform_action and exit points, using sendstream and send-atom, 
allow the overlapping of computations when a process is executing per- 
form-communication. In multiprocessors and super scalar processors, which 
may execute instructions in parallel and speculate about their execution, 
performance might be improved. 

4.3 Guide automaton: Controlling Activation of Ports 

A guide automaton is an abstract data type, implemented in C, used for con- 
trolling and validating the activation order of ports in execution of Haskell^ 
programs. It might be algebraically described by a tuple of the following 
form: 

C = (n, Q, T, (fo, (fi, p, F, S, a, vr, 7, k) 
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Figure 24: Guide Automaton 

where: 

• n is a set of port identifiers that forms the alphabet of the guide au- 
tomata; 

• Q is a finite set of states; 

• T is a finite set of transitions; 

• ipo : T ^ Q maps each transition to its origin state; 

• (fi'.T^Q maps each transition to its target state; 

• p : T — )• n labels each transition with a port identifier; 

• -F C Q is a set of final states; 

• 5* is a finite set of symbols, representing semaphores; 

• a : Q ^ 2^^^°'^ associates states to semaphore updates. For instance, 
consider a semaphore s G S*. If (s, n) G a{q) then the value of s must 
be incremented by n when entering state g; 

• TT : Q — {forward, choice, fork, join} gives the kinds of the states; 

• 7 : Q — !■ {True, False} maps choice states to an expression {termi- 
nation condition of a repeat combinator) that evaluates to True or 
False; 

• K : Q 2'^ X 2^ associates a state q, to a pair of set of states {Q\ Q^), 
whose meaning depends on 7i{q) (see the next paragraph). 
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States and transitions are represented as natural numbers. The initial 
state is (zero). Let q be the current state of a guide automaton. The func- 
tion perform_actions looks up K{q) in order to choose the next communication 
operation to be performed. For instance, consider K,{q) = {Q\Q^). There 
must be a path from state q to each state in Li Q^. If Tc{q) = forward, 

= and determines the forward states of q. Among them, the goal 
states are chosen. For that, let us consider a set of transitions T"^ = {t \ 
Viif) = f/' Ag' G At is in a path from q to g'}. Port p is chosen from ports 
{p I t G A p(t) = p}, among those whose communication pairs are active 
at that instant (ready for communication). Forward states g, such that, for 
some t G T**, ^pi{t) = q and p{t) = p, are goal states. Choices appear only 
in the implementation of occurrences of the alt constructor. The port p is 
activated. If p is an output port (default case), it may cause the implicit acti- 
vation of input ports, in recvstream or recv-atom function calls, before com- 
pleting communication. After any port activation in perform-communication, 
the advance-automata function is called for updating the current automata 
state, validating the operation, by raising an error whenever there is no tran- 
sition from the current state labelled with the activated port, and updated 
semaphores. After the activation of p, the guide automaton must be in one 
of the goal states. Otherwise, the operation is invalid. If 7i{q) = choice, ji^q) 
must be evaluated (termination condition of a repetition). If 7(g) is true, 
the set of forward states of g is Q', otherwise it is Q'^. Choice states are 
used in the implementation of occurrences of repeat and if combinators. If 
7r(g) = fork, Q'^ = and Vt : (po{t) = q : (pi{t) e A p{t) = ±. When a 
fork state is reached, threads are forked for executing communication actions 
starting from the states in Q'. All threads must reach the same join state, 
where they finalize and resume execution from that state. If 7r(g) = join, 

= and = 0. Fork and join states are used to implement occurrences 
of par combinator. If there is no forward state from current state and it is 
a final state, perforrri-actions finalizes. 

Semaphores are updated in calls to advance_automata. The function a is 
used to update their values according to the new current state. A semaphore 
must have more than one value at a time. During execution, it must be guar- 
anteed that all semaphores must be at least one positive value. Otherwise, 
an error is informed. Negative values are discarded. Semaphores only exist 
for validating non-regular patterns of communication that may be described 
by labelled Petri nets |89j. However, in general, regular patterns of com- 
munication are sufficient to describe behavior of most of high-performance 
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Table 1: Meaning of parameters of mpLregister_pair and mpLregister_peers 



Parameter 



pair peer Description 



Direction ★ 
Source/ Target rank * 
Channel tag ★ 
Collective Op. Type ★ 
Number of Processes * 
Processes in group * 
Buffer Size ★ * 

Data Type * 
Reduce Operation * 
Is Probed Flag ★ 
Pair is Probed Flag ★ 



Specifies if a port is for input or output 
Rank of the process that owns its comm. pair 
A number that identifies individually a channel 
Kind of the collective communication operation 
Number of processes in the collective operation 
Ranks of processes in the collective operation 
Buffer used for storing data to be transmitted 
MPI data type (used in a reduce operations) 
MPI operation (used in a reduce operations) 
Flag indicating if a port belongs to a choice group 
Flag indicating if the communication pair of a 
port belongs to a choice group. 



parallel programs [BSlEn]. Thus, overhead due to semaphore updating might 
be avoided for parallel programs where peak performance is critical. 

4.4 Implementing Communication Operations 

There are two kinds of communication operations in Haskell#: point-to-point 
and collective. The former is implemented through simultaneous activation 
of channel's communication pairs. MPI tags, in message envelopes, repre- 
sent communication channels in calls to point-to-point primitives. The later 
is implemented using MPI support for dynamic configuration of communi- 
cation groups and contexts and MPI collective communication primitives. 
Groups of ports involved in a collective communication are called commu- 
nication peers. Each communication pair is configured using the function 
mpi-register_pair, while communication peers are configured in a single call 
to mpi_register_peers. These functions are implemented in C, being called 
from Haskell code through FFI. Their arguments, detailed in Table [11 set up 
parameters for completion of communication operations over involved ports 
during execution. A communication handle, an integer number, is returned 
and bound to a variable for allowing to access pair/peers information when- 
ever necessary. 

The polymorphic and higher-order function perform_communication has 
one argument, a value from the algebraic data type Portlnfo t u v, whose 
constructors identifies the kind of communication operation to be performed: 
SinglelPort, SingleOPort, GroupIPort, GroupOPort {point-to-point 
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communication) , Beast, Gather, Scatter, Scatterv, Allgather, Allgath- 
erv, Allreduce, Alltoall, Alltoallv, Reduce_Scatter, Scan (collective 
communication. The Portlnfo^s fields encapsulate necessary information for 
completion of communication operations: communication handle, port type 
(choice or combine), wire functions, and channel variables. The type vari- 
ables t, u and V are used for generalization of channel variables and wire 
functions types. 

The MPI point-to-point communication primitive used for completion 
of communication over an output individual port (SingleOPort) depends 
on the communication mode of the channel where it is linked: buffered 
(MPI_Bsend), synchronous (MPI_Ssend) or ready (MPI_Rsend). For groups 
of output ports of kind All, the corresponding asynchronous MPI sending 
primitives (MPI_Ibsend, MPI_Issend and MPI_Irsend) are used for initiating 
the communication on each port belonging to the group. Then, a call to 
MPI_Waitall waits for the completion of all the returned request. Similarly, a 
call to MPLRecv implements the communication on individual input ports, 
while MPI_Irecv (asynchronous) and MPI_Waitall, implements groups of 
input ports of kind All. Groups of ports of kind Any are implemented 
using the channel probing protocol, which allows the verification of the status 
of activation of communication pairs. 

Transmitting streams and atom values. In Haskell^, a value of type t 
is transmitted as a value of algebraic type Comm t, whose Haskell represen- 
tation is depicted below: 

data Comm t = Atom {data :: t} \ Mm {data :■ t} \ End {depth::lnt} 

The Atom constructor encapsulates atomically transmitted values, while 
streamed ones are encapsulated using Mm and End constructors. The 
integer value in the End field represents the depth of a finalized stream. 

For instance, consider a stream port p of type (Int,Int) and nesting fac- 
tor 2 (p**::(Int,Int)). The lazy list associated to the port must be of type 
[[(Int,Int)]]. Consider the lazy list [[[(1,2),(3,4)], [(5,6)]], [[(7,8),(9,0),(1,2)]], 
[[], [(3,4)], [(5, 6), (7, 8)]]]. The list of values effectively transmitted through 
the stream port p at each activation is [Mm (l,2),Mro (3,4), End 3, Mm 
(5,6), End 3, End 2, Mm (7,8), Mm (9,0),Mro (1,2), End 3, End 2, End 
3, Mm (3,4), End 3, Mm (5,6), Mm (7,8), End 3, End 2, End 1]. When- 
ever possible, stream communication is implemented using MPI persistent 
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communication objects, for minimizing communication overhead. 

Marshalling Haskell Values to C Buffers. In order to transmit Haskell 
values using MPI primitives, they must be marshalled onto C contiguous 
buffers. For that, the Storable class, from FFl, is employed. Default Storable 
instances are provided for basic data types. User defined data types should 
be instantiated for this class. The Haskell# compiler traverses Haskell mod- 
ules of the Haskell^ program for finding user defined type values that must 
be instantiated for the Storable class. Structured data types, such as lists, 
arrays, tuples and algebraic data types must be packed and unpacked ele- 
ment by element. This could result in a considerable source of inefficiency 
when number of elements is very large. The benchmarks presented in Sec- 
tion [SH] evidence this fact. GHC provides unboxed arrays, whose values are 
stored in contiguous memory areas and can be directly marshalled to MPI 
buffers. Since most high performance computing applications operate over 
arrays, and not using lists, unboxed arrays may be used in order to avoid 
this source of inefficiency. 

5 Performance Evaluation 

This section presents some performance figures for Haskell# programs pre- 
sented in Section [31 The architecture used is a Beowulf cluster comprising 16 
dual Intel Xeon processors (clock: 2 GHz, RAM: 1GB), connected through 
a Fast Ethernet (lOOMBs). Measures with 32 nodes were performed in dual 
multiprocessing mode. MPICH 1.2.3 on top of TCP/IP was used for com- 
munication between processes. 

5.1 Benchmarking Haskell^ with NPB 

The benchmark results of Haskell^ versions of NPB kernels (EP, IS and CG) 
are presented in Figure [251 The plots to left hand side present their respective 
running times, while the plots at right hand side presents their corresponding 
absolute speedups, comparing them to linear speedup, always represented 
by a solid line. 

Two problem instances were used for measuring performance of Haskell^ 
kernel versions (Table [2]). In the second one, processes demand about twice 
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Figure 25: Performance Figures of NPB kernels in Haskell^ 
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Ta ble 2: Instances of Problem Sizes Used to Run Each Kern el 
Kernel 1"** Problem Size 2"^^ Problem Size 



EP 



m 



25 



m 



28 



IS total -keys Jog2 = 20 
max_keyJog2 =16 



totaLkeysJog2 = 21 
max_keyJog2 =17 



CG 



na = 14000 
nonzer = 11 
niter = 45 



na = 18000 
nonzer = 12 
niter = 45 



as much memory space as the first one, without exhausting physical mem- 
ory resources of a single node of the cluster. The default problem classes 
of NPB (S,W,A,B,C) were not used because they were tuned for use with 
C/FORTRAN + MPI original versions. Due to laziness and the use of im- 
mutable arrays, sequential performance of Haskell# versions are about an 
order of magnitude worse than the performance of the original versions of 
NPB kernels, both considering time and space. Because of that, some default 
problem sizes exhaust physical memory resources of cluster nodes, causing 
virtual memory overheads that must be avoided in measures. The use of 
mutable arrays could minimize this source of inefficiency, but they require 
the encapsulation of computations inside the 10 monad, preventing arrays 
of being transmitted through lazy lists. 

Also due to performance differences in sequential mode of execution, gran- 
ularity of Haskell^ processes is coarser than the granularity of processes 
in original NPB versions. While Haskell computations execute slower than 
C/FORTRAN computations, the amount of data transmitted is about the 
same. The original speedup measures of NPB kernels serve only to estab- 
lish the lower bounds of the performance of the cluster. One should not use 
that to make assumptions and claims about relative efficiency of Haskell:^ 
implementation. 

Using GHC profiling tools |81], five main cost centres were identified in 
CG and IS Haskell^ implementations. Table [3] presents the impact of each of 
them in parallel execution. The impact of cost centres in speedup is evaluated 
on Table HI By analyzing the data obtained, one may be conclude that: 

1. If only time spent in computation is considered, the speedup is linear; 
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Tabic 3: Cost Centre Analysis of IS and CG (% of total execution time) 
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i: Raw computation time, ii: Evaluation of wire functions, iii: Marshalling, 



iv: Communication and synchronization, v: Garbage Collection 



2. The marshalling cost centre is the unique source of overhead inherent to 
Haskell^ implementation. The other ones are inherent to parallelism. 
In some cases, marshalling overhead increases with the number of pro- 
cessors (CG-1 and CG-2). Marshalling could be avoided if GHC allows 
to copy immutable arrays to contiguous buffers in constant time. But 
this feature could not be provided yet; 

3. The garbage collection overhead decreases by increasing the number 
of processors used in parallel computation. This fact is attributed to 
less use of heap when the problem size is split among more processors 
and the enforcement of data locality. Cache behavior effects are also 
being investigated. It is worthwhile to remember that garbage collector 
parameters were tuned before execution. The results obtained here do 
not guarantee that every Haskell program presents the same behavior; 

4. In CG, whenever number of processors increases, the gains in perfor- 
mance due to the minimization of the garbage collection overhead ap- 
pears to compensate losses due to the marshalling overhead. Thus, in 
some cases, Haskell^ overhead may be considered null. Indeed, assum- 
ing that arrays are copied directly and in constant time, the minimiza- 
tion of the garbage collection overhead could compensate their sources 
of overhead that are inherent to parallelization; 



44 



Table 4: Influence of Cost Centres in Speedup 
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a: i, b: i/ii, c: i/ii/iii, d: i/ii/iii/iv, e: i/ii/iii/iv/v 



The observations above are evidences that Haskell# programs are an effi- 
cient approach for parallelizing functional computations. The fact observed 
that splitting of problems among processors may reduce the garbage col- 
lection overheads is another motivation for using Haskell^ for parallelizing 
scientific high-performance applications written in Haskell, in addition to the 
gains in execution time of computations, since this kind of application nor- 
mally processes large data structures stored in memory. The benchmarks 
presented in the next section compare Haskell^ to other parallel functional 
languages. 

5.2 Benchmarking Haskell^ with Loild's Benchmark 
Suite 

The benchmarking results of Haskell^ implementations of Matrix Multipli- 
cation (MM), LinSolv (LS), and Ray Tracer (RT), based on Eden and GpH 
versions presented in [59] , are shown in Figure |26l The parameters are de- 
scribed on Table [51 Since the cluster used has nodes about three times as fast 
as than nodes of the cluster used in Loild's measures, the size of the problem 
instance of MM and RT used in this paper are larger. This attempts to 
approximate the sequential run-time of original measures and the increase of 
granularity of computations. For LS, however, the same problem size is used 
since its scalability is less sensitive to variations in problem size. 

The speedup curves of LS and RT are nearly linear, while the speedup 
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Figure 26: Performance Figures of MM, LS and RT in Haskell^ 
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Table 5: Problem Instance Parameters for Loild's Benchmark Suite 



Matrix Multiplication 960 x 960 matrices of integers with maximum 

value of 65536. 

LinSolv Dense 62 x 62 matrix of arbitrary precision 

integers with maximum value of 2^^ — 1. 

Ray Tracer An 1000 x 1000 image (in pixels) with a scene 

comprising 640 spheres. 



curve of MM is negatively affected due to the overhead caused by marshalling 
large nested lists of integers. For row and block clustering of MM, the times 
measured in 16 processors were little worse than those obtained for 8 and 
9 processors, respectively. The marshalling overhead could be minimized 
by use of Haskell arrays instead of lists to represent matrices. The Haskell^ 
implementations for NPB kernels, where the amount of exchanged data is far 
larger, evidence this hypothesis. The toroidal version of MM yields a better 
performance scalability in comparison to row and block clustering, once the 
amount of data transmitted is comparatively smaller. 

It is important to observe that measures of LS and RT for 32 processors 
were obtained on dual processing mode across 16 nodes of the cluster. Un- 
expected additional overhead was observed when executing MPI programs 
using the dual mode processing capabilities. This effect was more easily ob- 
served when measuring the run time of Haskell^ versions of NPB kernels 
CG and IS, probably due to the large amount of data exchanged between 
processors in collective communication. Because of that, the results for NPB 
with 32 processors was not presented. Best speedup for LS and RT were 
expected for 32 non-dual processors. For that reason, in following discus- 
sion, the measures with 16 processors is used as a reference for comparing 
benchmarks of Haskell^ to benchmarks of GpH, Eden, and PMLS. 

Comparing Haskell^ results to the best ones obtained for Eden, GpH, 
and PNML described in |59], one may observe that Haskell^ results are 
slightly better in all cases. For example, MM using toroidal solution obtains 
a speedup of 11.0 on 16 processors, while a speedup of approximately 5.0 was 
the best obtained in Eden toroidal solution. For LS, the speedup obtained in 
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Haskell# is 14.3 on 16 processors, while the best speedup obtained in Eden 
version was 14.0. For RT, a speedup of 15.6 was obtained by Haskell^ on 16 
processors, while 15.1 was the best speedup obtained in PMLS version. 

The results presented herein are not yet sufficient to conclude that Haskell# 
programs are always more efficient than their GpH, Eden and PMLS versions. 
The two compared benchmarks were obtained for distinct architectures and 
using different problem sizes. However, the results presented in this paper ev- 
idence that Haskell# implementation presents comparable behavior to well- 
known and mature implementations of parallel functional languages, such 
as GpH, Eden and PMLS. The results obtained are not surprising, since 
Haskell^ run-time system is very light in relation to the complex parallel 
run-time systems of GpH, Eden, and PMLS, which try to hide some par- 
allel management details from programmers at different degrees. Decades 
of experience in parallel languages design have shown that as explicit as it 
is a general parallel language, assuming that it has an efficient implemen- 
tation, best scalability is obtained using a simpler run-time system. The 
combination of the results obtained in this paper and in [52] only conffim 
this hypothesis. In this sense, Haskell^ is the most explicit of all, followed 
by Eden, PMLS and GpH, respectively. 

6 Conclusions and Lines for Further Work 

This paper introduces Haskell#, a coordination language for describing par- 
allel execution of functional computations in Haskell. Haskell^ intends to 
raise the level of abstraction in explicit message-passing parallel program- 
ming on distributed architectures, such as clusters, for the development of 
large scale parallel scientific computing applications. Motivating examples, 
implementation issues and performance figures of Haskell# benchmarks are 
also presented. 

After some years of design, implementation and evaluation, Haskell^ has 
reached some maturity. Several works unfoldings are on progress. Firstly, a 
parallel programming environment based on Haskell^ have been prototyped 
in JAVA, including the support for programming with visual abstractions, 
integration to Petri net tools for animation, proving of formal properties, and 
performance evaluation of programs. It is also under development the use of 
network simulators, such as Network Simulator (NS) [Mj for simulating the 
performance of parallel programs. Such tool will allow to study the effect of 
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modifications to network characteristics on performance of parallel programs. 
This work has important impact on studying behavior of Haskell^ programs 
whenever executing on grids. 

Since HCL is a coordination language orthogonal to Haskell, it is con- 
ceptually possible to use other languages, in alternative to Haskell, for pro- 
gramming functional modules. The parallel programming environment under 
development assumes that Haskell is the ideal language for specifying, proto- 
typing and evaluating the formal properties of parallel programs. Once paral- 
lel composition is proved be safe, programmers may implement the functional 
modules using a language more appropriate for implementing the function- 
ality of the simple components. For example, numerical intensive functional 
modules could be implemented in Fortran, while sorting of large amount 
of numbers in parallel may be implemented in C. JAVA can be used for 
programming functional modules that make access to some database. This 
kind of multi-lingual compositional approach is a further development. One 
important design difficult with multilingual approach is to maintain the or- 
thogonality between languages used at coordination and computations levels 
in absence of lazy and higher-order functional programming. Imperative lan- 
guages, for example, do not allow to hide the control flow. It is intended to 
use techniques from aspect oriented programming (AOP) for addressing this 
matter. In this direction, parallel composition could be treated as an aspect 
of programming. The recent appearance of heterogeneous versions of MPI 
[HI] is important for making feasible a multi-lingual approach for Haskell^. 

An even more relevant important topic to be addressed is to develop 
cost models for Haskell# skeletons, incorporating the possibility of overlap- 
ping them, and to use it for allowing Haskell# compiler to make automatic 
decisions, such as better allocation of processes to processors, use of spe- 
cial primitives, and special restrictions on communication modes, such as 
the size of buffers. However, a recent idea is to design a meta-language for 
programmers to teach explicitly Haskell^ compiler on how to generate the 
appropriate code for a given skeleton or a combination of skeletons. The 
latter approach is more in tune with Haskell^ design premisses. However, it 
is not difficult to see that the two lines could be combined. 

Further developments will address grid enabled implementations of Haskell#. 
A grid enabled version of MPI, such as the recently proposed MPICH-G2 [49j . 
might be used. 



49 



References 



[1] F. Arbab, P. Ciancarini, and C. Hankin. Coordination Languages for 
Parallel Programming. Parallel Computing, 24(7):989-1004, 1998. 

[2] D. H. Bailey and et al. The NAS Parallel Benchmarks. International 
Journal of Supercomputing Applications, 5(3):63-73, 1991. 

[3] D. H. Bailey, T. Harris, W. Shapir, R. van der Wijngaart, A. Woo, 
and M. Yarrow. The NAS Parallel Benchmarks 2.0. Technical 
Report NAS-95-020, NASA Ames Research Center, December 1995. 
http://www. nas. nasa. org/NAS/NPB 

[4] M. Baker, R. Buyya, and D. Hyde. Cluster Computing: A High Perfor- 
mance Contender. IEEE Computer, 42(7):79-83, July 1999. 

[5] S. Balay, K. Buschelman, W. Gropp, D. Kaushik, M. Knepley, L. C. 
Mclnnes, B. Smith, and H. Zhang. PETSc Users Manual. Technical Re- 
port ANL-95/11 Revision 2.1.3, Argonne National Laboratory, Argonne, 
Illinois, 1996. http://www.mcs.anl.gov/petsc. 

[6] E. Barscz, R. Fatoohi, V. Venkatakrishnan, and S. Weeratunga. Solu- 
tion of Regular, Sparse Triangular Systems on Vector and Distributed- 
Memory Multiprocessors. Technical Report NAS RNR-93-007, NASA 
Ames Research Center, April 1993. 

[7] G. Baumgartner, Bernholdt D. E., D. Cociorva, R. Harrison, S. Hirata, 
C. Lam, M. Nooijen, R. Pitzer, J. Ramanujam, and P. Sadayappan. A 
High-Level Approach to Synthesis of High- Performance Codes for Quan- 
tum Chemistry. In 2002 ACM/IEEE conference on Supercomputing, 
pages 1-10, 2002. Baltimore, Maryland, USA. 

[8] D. J. Becker, T. Sterling, D. Savarese, J. E. Dorban, U. A. Ranawak, and 
C. V. Packer. Bewoulf: A Parallel Workstation for Scientific Computa- 
tion. In 1995 International Conference on Parallel Processing, 1995. 

[9] G. Bell and J. Gray. What's the Next in High Performance Computing. 
Communications of the ACM, 45(2):91-95, 2002. 

[10] D. E. Bernholdt, J. Nieplocha, and P. Sadayappan. Raising Level 
of Programming Abstraction in Scalable Programming Models. In 



50 



IEEE International Conference on High Performance Computer Archi- 
tecture (HPCA), Workshop on Productivity and Performance in High- 
End Computing (P-PHEC), pages 76-84. Madrid, Spain, IEEE Com- 
puter Society, 2004. 

[11] M. Bertozzi, G. Chiola, G. Giaccio, G. Gonte, P. Marenzoni, A. Poggi, 
and P. Rossi. DISGO Report on the State-of-the-Art of PG Gluster 
Computing. Technical Report DISI-TR-98-09, DISI, Universita de Gen- 
ova, December 1998. 

[12] E. Best, J. Esparza, B. Grahlmann, S. Melzer, S. Rmer, and F. Wallner. 
The PEP Verification System. In Workshop on Formal Design of Safety 
Critical Embedded Systems (FEmSys'97), 1997. 

[13] J. M. Bishop. Languages for Configuration Programming: A Compari- 
son. Technical Report 94-04, University of Pretoria, 1994. 

[14] S. Breitinger, R. Logen, Y. Ortega Mallen, and R. Pena. Eden - The 
Paradise of Functional Concurrent Programming. Lecture Notes in Com- 
puter Science (EUROPAR'96), 1123:710-713, 1996. 

[15] J. F. Briesmeister. MCNP - A General Monte Carlo N-Particle Trans- 
port Code. Technical Report LA-12625-M, Los Alamos National Labo- 
ratory, 1993. 

[16] W. H. Burge. Recursive Programming Techniques. Addison- Wesley 
Publishers Ltd., 1975. 

[17] G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment 
for MPI. In Proceedings of Supercomputing Symposium, pages 379-386, 
1994. 

[18] F.W. Burton. Functional Programming for Concurrent and Distributed 
Computing. Computer Journal, 30(5):437-450, 1987. 

[19] R. Buyya (ed.). High Performance Cluster Computing: Architectures 
and Systems. Prentice Hall, 1999. 

[20] R. Buyya (ed.). High Performance Cluster Computing: Programming 
and Applications. Prentice Hall, 1999. 



51 



[21] F. H. Carvalho Junior and R. D. Lins. Topological Skeletons in 
Haskell^. In International Parallel and Distributed Processing Sympo- 
sium (IPDPS). IEEE Press, April 2003. 8 pages. 

[22] F. H. Carvalho Junior, R. D. Lins, and R. M. F. Lima. Parallelising 
MCP-Haskell# for Evaluating Haskell^ Parallel Programming Environ- 
ment. In UnB, editor, 13th Brazilian Symposium on Computer Architec- 
ture and High- Performance Computing (SBAC-PAD 2001), September 
2001. 

[23] F. H. Carvalho Junior, R. D. Lins, and R. M. F. Lima. Translating 
Haskell^ Programs into Petri Nets. Lecture Notes in Computer Science 
(VECPAR'2002), 2565:635-649, 2002. 

[24] M. (editor) et all Chakravarty. The Haskell 98 Foreign Function In- 
terface (FFI) 1.0 (An Addendum to Haskell 98 Report). Glasgow Uni- 
versity, Departament of Computing, Functional Programming Research 
Group, 2002. 

[25] M. Cole. Algorithm Skeletons: Structured Management of Paralell Com- 
putation. Pitman, 1989. 

[26] J. Darlington, Y. Guo, H.W. To, and J. Yang. Functional Skeletons for 
Parallel Coordination. Lecture Notes in Computer Science, 966:55-68, 
1995. 

[27] F. DeRemer and H. H. Kron. Programming-in-the-Large versus 
Programming-in-the-small. IEEE Transactions on Software Engineer- 
ing, pages 80-86, June 1976. 

[28] J. Dongarra, 1. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, 
and A. White. Sourcebook of Parallel Computing. Morgan KaufTman 
Pubhshers, 2003. 

[29] J. Dongarra, S. W. Otto, M. Snir, and D. Walker. An Introduction to the 
MPl Standard. Technical Report CS-95-274, University of Tennessee, 
January 1995. |http : / / www, netlib . org / tennessee / ut-cs-95- 2 74 . ps 

[30] C. Dornan, 1. Jones, and S. Marlow. Alex User Guide, 2003. 
http: / / www.haskell.org/alex 



52 



[31] K. Fall and K. Varadhan. The NS Manual (formely NS Notes and 
Documentation). Technical report, The VINT Project, A Collaboration 
between researchers at UC Berkeley, LBL, USC/ISI, and Xerox PARC, 
April 2002. 

[32] High Performance Fortran Forum. High Performance Fortran, Language 
Specification, Version 2.0, January 1997. 

[33] I. Foster. Compositional Parallel Programming Languages. ACM Trans- 
actions on Programming Languages and Systems, 18(4):454-476, 1985. 

[34] I. Foster and C. Kesselman. The Grid 2: Blueprint for a New Computing 
Infrastructure. M. Kauffman, 2004. 

[35] D. Gelernter and N. Carriero. Coordination Languages and Their Sig- 
nificance. Communications of the ACM, 35(2):97-107, February 1992. 

[36] R. German. SPNL: Processes as Language-Oriented Building Blocks 
of Stochastic Petri Nets. In 9th Conference on Computer Performance 
Evaluation, Modelling Techniques and Tools, pages 123-134. Springer 
Verlag, 1997. 

[37] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, 
Portable Implementation of the MPI Message Passing Interface Stan- 
dard. Parallel Computing, 22(6):789-828, 1996. 

[38] M. M. Hamdan. A Combinational Framework for Paralell Programming 
Using Skeleton Functions. PhD thesis. Department of Computing and 
Electrical Engineering, Hariot-Watt University, January 2000. 

[39] J. Hammes, O. Lubeck, and W. Bohm. Comparing Id and Haskell in a 
Monte Carlo Photon Transport Code. Journal of Functional Program- 
ming, pages 283-316, July 1995. 

[40] K. Hammond, J. Berthold, and R. Loogen. Automatic Skeletons in 
Template Haskell. Parallel Processing Letters, 13(3), 2003. 

[41] K. Hammond and G. Michaelson. Research Directions in Parallel Func- 
tional Programming. Springer- Verlag, 1999. 

[42] C. Herrman and C. Lengauer. A Higher-Order Language for Dividing 
and Conquer. Paralell Processing Letters, 10(2-3) :239-250, 2000. 



53 



[43] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall 
International Series in Computer Science, 1985. 

[44] P. Hudak. Serial Combinators: "Optimal" Grains of Parallelism. In 
FPCA'85, pages 382-399, September 1985. 

[45] P. Hudak. Para-Functional Programming in Haskell. Parallel Functional 
Languages and Compilers, B. K. Szymanski, Ed. ACM Press, New York, 
pages 159-196, 1991. 

[46] Inmos. Occam Programming Manual. Prentice-Hall, C.A.R. Hoare Se- 
ries Editor, 1984. 

[47] T. Ito and Y. Nishitani. On Universality of Concurrent Expressions with 
Synchronization Primitives. Theoretical Computer Science, 19:105-115, 
1982. 

[48] N. Karonis, B. D. Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresna- 
han. Exploiting Hierarchy in Parallel Computer Networks to Optimize 
Collective Operation Performance. In l^th International Parallel and 
Distributed Processing Symposium, pages 377-384, 2000. Los Alamitos, 
CA, USA. 

[49] N. Karonis, B. Tooncn, and I. Foster. MPICH-G2: A Grid-enabled 
Implementation of the Messaging Passing Interface. Journal of Parallel 
and Distributed Computing, 63 (5): 55 1-563, 2003. 

[50] O. Kaser, C.R. Ramakrishnan, I. V. Ramakrishnan, and R. C. Sekar. 
Equals - A Fast Parallel Implementation of a Lazy Language. Journal 
of Functional Programming, 7(2):183-217, March 1997. 

[51] P. Kelly. Functional Programming for Loosely-coupled Multiproces- 
sors. Research Monographs in Parallel and Distributed Computing, MIT 
Press, 1989. 

[52] G. Kiczales, J. Lamping, Menhdhekar A., Maeda C, C. Lopes, J. Lo- 
ingtier, and J. Irwin. Aspect-Oriented Programming. In Lecture Notes in 
Computer Science (Object-Oriented Programming 11th European Con- 
ference - ECOOP '97), volume 1241, pages 220-242. Springer- Verlag, 
November 1997. 



54 



[53] J. Krammer. Distributed Software Engineering. In IEEE Computer So- 
ciety Press, editor, Proc. 16th IEEE International Conference on Soft- 
ware, 1994. 

[54] Lauer M. Computing by Homomorphic Images. In G. Buchberger, G. E. 
Collins, R. Loos, and Albrecht R., editors. Computer Algebra - Symbolic 
and Algebraic Computation, pages 139-168. Springer, 1982. 

[55] R. M. F. Lima. Haskell^ - Uma Linguagem Funcional Paralela - Ambi- 
ente de Programagdo, Implementagdo e Otimizagdo. PhD thesis, Centro 
de Informaica, UFPE, July 2000. 

[56] R. M. F. Lima and R. D. Lins. Translating HCL Programs into Petri 
Nets. In Proceedings of the l^th Brazilian Symposium on Software En- 
gineering, 2000. 

[57] R. D. Lins. Functional programming and parallel processing. In 2nd In- 
ternational Conference on Vector and Parallel Processing - VECPAR '96 
- LNCS 1215 Springer- Verlag, pages 429-457, September 1996. 

[58] J. D. Lipson. Chinese Remainder and Interpolation Algorithms. In 
SYMSAM'71 - Symposium on Symbolic and Algebraic Manipulation, 
pages 372-391. Academic Press, 1971. 

[59] H.-W. Loidl, S. Priebe, A. J. Rebon, P. W. Trinder, F. Rubio, N. Scaife, 
K. Hammond, S. Horiguchi, L. Loogen, G. J. Michaelson, and R. Pe na. 
Comparing Parallel Functional Languages: Programming and Perfor- 
mance. Higher-Order and Symbolic Logic, 16(3):203-251, September 
2003. 

[60] H. W. et al Loidl. LinSolv: A Case Study in Strategic Parallelism. In 
Glasgow Workshop on Functional Programming, pages 15-17, 1997. 

[61] S. Lucco and 0. Sharp. Delirium: An Embedding Coordination Lan- 
guage. In Proceedings of the 1990 Conference on Supercomputing, pages 
515-524. IEEE Computer Society Press, 1990. 

[62] S. Marlo w and A. Cill. Happy User Guide, 2001. 

http://www. haskell. org/happy. 



55 



[63] Message Passing Interface Forum. MPI: A Message-Passing Interface 
Standard. International Journal of Supercomputer Applications and 
High Performance Computing, 8(3-4):169-416, 1994- 

[64] G. Michaelson, N. Scaife, and P. King. Nested Algorithmic Skeletons for 
Higher Order Functions. Parallel Algorithms and Applications, 16:181- 
206, 2001. 

[65] E. E. Miller and R. H. Katz. Input/ Output Behavior of Supercomputing 
Applications. In Proceedings of Conference Supercomputing'91, pages 
567-576, November 1991. 

[66] J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global Ar- 
rays: A Non-Uniform-Memory- Access Programming Model for High- 
Performance Gomputers. The Journal of Supercomputing, 10(2):169- 
189, 1996. 

[67] OpenMP Architecture Review Board. OpenMP: Simple, Portable, Scal- 
able SMP Programming, 1997. 

[68] H. Ossher and P. Tarr. Multi-Dimensional Separation of Goncerns 
and the Hyperspace Approach. In Proceedings of the Symposium on 
Software Architectures and Component Technology: The State of the 
Art in Software Development. Kluwer Academics, June 2000. University 
of Twente, Enschede, The Netherlands. 

[69] G. Pareja, R. Pe na, F. Rubio, and G. Segura. Optimizing Eden 
by Transformation. In S. Gilmore, editor. Trends in Functional Pro- 
gramming (2nd Scottish Functional Programming Workshop), volume 1, 
chapter 4, pages 39-52. Intellect Books, 2000. 

[70] B. K. Pasquale and G. Plyzos. A Statis Analysis of I/O Gharacterization 
of Scientific Applications in a Production Workload. In Proceedings of 
Conference Supercomputing'93, pages 388-397, November 1993. 

[71] J. L. Peterson. Petri Net Theory and Modehng of Systems. Prentice- 
Hall, 1981. 

[72] G. A. Petri. Kommunikation mit Automaten. Technical Report RAD C- 
TR-65-377, Griffiths Air Force Base, New York, 1(1), 1966. 



56 



[73] S. L. Peyton Jones, C. Clack, and J. Salkild. GRIP - A High- 
Performance Architecture for Parallel Graph Reduction. FPCA'87: Con- 
ference on Functional Programming Languages and Computer Architec- 
ture - Springer- Verlag LNCS 274, pages 98-112, 1987. 

[74] S. L. Peyton Jones, A. Gordon, and S. Finne. Concurrent Haskell. In 
POPL'96 - Symposium on Principles of Programming Languages, ACM 

Press, pages 295-308, January 1996. 

[75] S. L. Peyton Jones and J. (editors) Hughes. Report on the Program- 
ming Language Haskell 98, A Non-strict, Purely Functional Language, 
February 1999. 

[76] M. J. Plasmeijer and M. van Eekelen. Functional Programming and 
Parallel Graph Rewriting. Addison- Wesley Publishers Ltd., 1993. 

[77] Post D. E. The Coming Crisis in Computational Science. In IEEE 
International Conference on High Performance Computer Architecture 
(HPCA), Workshop on Productivity and Performance in High- End 
Computing (P-PHEC). Madrid, Spain, 2004 . 

[78] M. Quinn. Parallel Computing. McGraw-Hill, 1994. 

[79] F. A. Rabhi and S. Gorlatch. Patterns and Skeletons for Parallel and 
Distributed Computing. Springer, 2002. 

[80] S. Rock and P. Starke. Manual: Integrated Net Analyzer Version 2.2, 
1999. 

[81] P. M. Sansom and S. L. Peyton Jones. Time and Space Profiling for 
Non-Strict, Higher- Order Functional Languages. In Proceedings of the 
22nd ACM SIGPLAN-SIGACT Symposium on Principles of Program- 
ming Languages, pages 355-366. ACM Press, 1995. 

[82] V. Sarkar, C. Williams, and K. Ebcioglu. Application Development Pro- 
ductivity Challenges for High-End Com,puting. In IEEE International 
Conference on High Performance Computer Architecture (HPCA), 
Workshop on Productivity and Performance in High-End Computing, 
pages I4-I8, 2004 . 



57 



[83] D. B. Skillicorn and D. Talia. Models and Languages for Parallel Com- 
putation. ACM Computing Surveys, 30:123-169, June 1998. 



[84] J- M. Squyres, A. Lumsdaine, W. L. George, J. G. Hagedorn, and J. E. 
Devaney. The interoperable message passing interface (IMPI) extensions 
to LAM/MPI. In Proceedings, MPIDC'2000, March 2000. 

[85] F. Taylor. Parallel Functional Programming by Partitioning. PhD The- 
sis, Department of Computing, Imperial College of Science, Technology 
and Medicine, University of London, January 1997. 

[86] P. Trinder, K. Hammond, J. S. Mattson Junior, A. S. Partridge, and S. 
P. L. Jones. GUM: A Portable Parallel Implementation of Haskell. In 
PLDr96 - Programming Languages Design and Implementation, pages 
19-88, 1996. 

[87] P. W. Trinder, H-W. Loidl, and R. F. Pointon. Parallel and Distributed 
Haskells. Journal of Functional Programming, 12(4/5):469-510, July 
2002. 

[88[ P. W. Trinder, K. Hammond, and H. W. Loidl. Algorithm + Strategy = 
Parallelism. Journal of Functional Programming, 8(l):23-60, January 
1998. 

[89] R. Valk and G. Vidal Naquet. Petri Nets and Regular Languages. Jour- 
nal of Computer and System Sciences, 23:299-325, 1981. 

[90] P. Wadler. Monads for Functional Programming. Advanced 
Functional Programming, LNCS 925, Springer- Verlag, 1995. 
\http://cm. bell-labs. com/cm/cs/who/wadler/papers/marktoberdorf/marktoberdor_^ ps. 

[91] Weber M. and Kindler E. The Petri Net Markup Language. Lecture 
Notes in Computer Science, 2002. Submitted for publication in april 
2002. 

[92] A. Zimmermann, J. Freiheit, R. German, and G. Hommel. Petri Net 
Modelling and Performability Evaluation with TimeNET 3.0. In 11th 
Int. Conf. on Modelling Techniques and Tools for Computer Performance 
Evaluation (TOOLS'2000), pages 188-202. Lecture Notes in Computer 
Sciente, 2000. 



58 



A The Formal Syntax of HCL 



In what follows, it is described a context-free grammar for HCL, the Haskell# 
Configuration Language, whose syntax and programming abstractions were 
informally presented in Section |2l Examples of HCL configurations and their 
meanings were presented in Sections |2] and [3l The notation employed here 
is similar to that used for describing syntax of Haskell 98 [75]. Indeed, some 
non-terminals from that grammar are reused here, once some Haskell code 
appears in HCL configurations. They are faced italic and bold. A minor 
difference on notation resides on the use of (. . .)', instead of [...], for describ- 
ing optional terms. For simplicity, notation for indexed notation is ignored 
from the description of formal syntax of HCL. It may be resolver by a pre- 
processor, before parsing. 



A.l Top-Level Definitions 

configuration — header declarationi . . . declarationn {n > 0) 

header — )■ component ID static-parameterJisf componentJnterface^ 

static-parameterJist^ < IDi . . . ID„ > {n > 0) 

componentJnterface^ ports-naming 

declaration — )■ import-decl \ use^decl \ iterator_decl \ interface^decl 
I unit_decl \ assign_decl \ replace_decl \ channeLdecl 
I unify_decl \ factorize^decl \ replicatc-decl \ bind-decl 
I haskelLcode 



A. 2 Use Declaration 

usc-decl —7- use usespec 

usespec — )■ id | iA.usespec \ id.{ usespeci , 



, usespeCn } 



(n > 1) 



A. 3 Import Declaration 

import-decl — )■ impdecl 



A. 4 Iterator Declaration 

iterator_decl — > iterator idi, . . ., id„ range [ numeric^exp , numeric-cxp ] {n> 1) 
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A. 5 Interface Declaration 

interface-decl — >■ interface {context —>y ID tyvari . . . tyvavk interfacespec 
interfacespec — >■ interface-portsspec 

(where : interfaceJnheritancef (behavior : hehavior-expressionf 

A. 5.1 Interface Ports Description 

interface_ports_spec port_specJist -> porLspecJist 

porLspecJist — )■ port_spec \ ( port_speci , ■ ■ ■ , portspeCn ) {n >2) 

port-spec — )■ id (*)■ (:: atype)^ \ id 

A. 5. 2 Interface Composition 

interfaceJnheritance — )■ interfaceslicei # . . . # interfaceslicek {k > 1) 
interfaceslice — >■ id @ ID | ID ports-naming-composition 
ports_naming-Composition — >■ ports-naming 

I ( ports_namingi ^ . . . ^ ports_namingn) {n > 1) 
ports_naming — )■ porLnamingJist -> port_namingJist 
port-naming Jist ^ id | ( idi , . . . , id„) (n > 1) 

A. 5. 3 Interface Behavior 



behavior-expression 
action 



condition 
disjunction 

sync_conjunction 
simplc-Conjunction 



— > (sem idi , . . . , id„)'' : action {n > 1) 

par { actioni ; . . . ; actionn } | seq { actioni ; . . . ; actionn } 

I alt { actioni actionn } \ repeat action condition^ 

I if condition then action else action 

I id ! I id ? I signal id | wait id (n > 2) 
— >■ until disjunction \ counter numeric_exp 
— >■ conjunctioni '|' ... '|' conjunctionn {n > 1) 
— )■ ( simplc-Conjunction ) \ simple_conjunction 
^ id I ( idi & . . . & id„ ) (n > 1) 



A. 6 Unit Declaration 

unit_decl — )■ unit unit_spec 

unit-spec {*y id (# unitJnterfaceY (wire wfsetupi , . . . , wfsetupnf 

unit-interfaced ID ports-naming-compositiow \ interfacespec 
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wf_setup — >■ id {group_type group_specf (: wirejunctionf 
group_spec — )■ { idi, . . ., id„ } | * numeric-exp 
group -type — )■ any | all 
wirejunctionf ? | exp 



A. 7 Assignment Declaration 

assign_decl — )■ assign assigned_component to assignedjunit 
assigned-component — )■ ID actuaLparameterJist^ ports-naming-composition^ 
actuaLparameterJis^ < numeric-expi , . . . , numeric-cxpn > {n> 1) 
assigned-unit — >■ qid ports-naming-composition' 

A. 8 Replace Declaration 

replacc-ded — >■ replace qid ports-naming-composition^ by operand-unit 



A. 9 Channel Declaration 

channeLdecl — )■ connect gic? -> qid to gic? <- , comrri-mode 
comrri-mode — > synchronous | buffered numeric_exp | ready 



A. 10 Unification Declaration 

unify^ded — >■ unify operand-uniti , . . . , operand-unitn to unitspec 

adjust wire wfsetupi , . . . , wfsetupk {n>2, k > 1) 
operand_unit qid ^ interface_patterni . . . # interface_patternn {n > 1) 

interface_patternf port_patternJist -> port-patternJist \ id 
port-patternJist pattern \ ( pattern^ , . . . , patternn ) 
pattern — >■ id | @ qid | - | — 



A. 11 Factorization Declaration 

factorize^decl f actorize operand-unit to unitspeci . . . unitspecn 

adjust wire wf setup i , . . . , wf setup k (n > 2, A; > 1) 
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A. 12 Replication Declaration 



replicatc-decl replicate operand-uniti , . . . , operand-unitn into numeric-e. 

adjust wire wfsetupi , . . . , wfsetupk (n > 2, k 

A. 13 Bind Declaration 

hind-declaration — > bind qid -> qid to -> id | bind qid <- qid to <- id 

A. 14 Miscelaneous 

haskelLcode — )■ topdecls 

qid idi '.' ... '.' id„ {n < 2) 
qlD IDi '.' ... '.' ID„ {n < 2) 

B An Algebraic Semantics for Haskell:^ Com- 
ponents 

This appendix presents an algebra intending to formalize semantics of Haskell^ 
programming abstractions at coordination level. A Haskell^ component H 
may be defined by an algebra with the following elements: 

H =< G,R,C > 

where G is a set of generators, R is a set of relations on generators, and 
C is a set of restrictions on relations, defined as following: 



C, composed com,ponents 
S, simple components 
U, units 
G, ports groupings 

G = ^ P, individual ports 

R, kinds of processes: repetitive or non-repetitive 

D, port directions: input or output 
T, port type: any or all 
M communication modes: synchronous, buffered or ready 
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CO : {*} — >■ C U 5 main component 

(5 : C — > 2^ , units that comprise a component 
ip : P ^ P, association of ports to argument/return points 

7 :[/—>■ C U 5, component associated to a unit 

TT : G ^ U X D, unit of a port grouping 

R = ^ 13 : U ^ G*, behavior of a unit 

T : Q 2^ X T, grouping of ports 

p : U R, type of process 

V : P X P X M communication channels 

X: G ^ Nat nesting factor of a stream port 

i : U -^2'^ x2^ interface of a unit 



C — {Ri, R2, R3, R-5) Rbj R-T) R-si R-iO) R-ii) R12} 

A Haskell^ program is a component that may execute. Essentially, it does 
not have virtual units in its composition (it is not a partial skeleton). All 
units are assigned to a component (Vw : u E U : (3c : c E C U S : ^{u) — c)). 

In what follows, the restrictions from Ri to R12 are described. They are 
formulas in predicate logic (predicate) of the following form (V61, 62, • • • , &n : 
R : P) or (36i, 62, • • • , : R '■ P), where V and 3 are the usual existential 
quantifier, bi, 1 < i < n, are bound variables, R is a formula that specifies 
the set of values of bound variables, and P is a logical predicate. 

The restriction Ri states that component u (main component) is the only 
component that is not assigned to any unit: 

Ri h Vw : « e f/ : -f{u) ^ oj (1) 

R2 states that cyclic dependencies may not occur in component hierarchy: 

R2 I- Vu : u e f/ A 7(u) ^ _L : u ^ (5 o 7)(u) 

where: 5(s) = _ s&S 

<^(c) = Ue*(c)(<5°7)W cgC 

R3 states that a cluster is repetitive whenever all units belonging to its 
assigned component are repetitive: 

R3 I- Vit : w e A (3c : c e C : 7(w) = c A 5{c) i- 0) : 
= Repetitive -i^ {Vu : u C [So 7)(u)) : p{u) = Repetitive) (3) 
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R4 state that groups of ports are disjoint, R5 states that all individual 
ports belong to a group of ports, and Re states that groups of ports must 
not be empty: 

R4Vg,g':5,5'GG:7r(5)n7r(5')=0 

RgVp 3g:pePAgeG:pe T{g) (4) 
R6V5 : 5 e G : r(.g) ^ 

In the algebra, all ports are treated as non-empty groups. Thus, an 
individual port in a Haskell^ program is represented as a group containing 
an uniqueport. The restrictions above makes possible to define a "inverse" 
relation t^I, such that r(p) returns the group g that p belongs. It is useful 
for simplifying next formulations. 

Restrictions R7, Rg, and Rg specifies rules for formation of channels. 
Respectively, they say that channels are point-to-point, unidirectional and 
have the same nesting factors: 



R.7 y- (pi,p\,m) evA {p°2,p\, m)eiy=^p1^p^4^p\^ p\ 
Rs V- {p°,p'', m) ev ^ (3-u, u' : u,u' e U : {tt o t){p°) = (u, Output) A (tt o t)(p*) = (u', Input)) 
Rg h {p°,p\m) ev^ (Aor)(p°) = (Aot)(p»)) 

(5) 

Let M be a cluster {pf{u) C C) and p be an individual port belonging to 
group g, such that Ti{g) = {u, d),ioT d E D {p belongs to interface of u). The 
restriction Rio ensures that ip{p) (argument or exit point of 7(u)) is a port 
with the same direction of p belonging to interface of unit m', such that u' is 
a unit belonging to the component 7(u) {u' G (5 o a){u)): 

Rio K Vp : (7roT)(p) = (M,d) A7(u) C C : ((7ro^)(p) = {u',d)Au' G {5 o ^){u)) (6) 

The restriction Rn defines the relation 6, which describes the interface 
of a unit: 

Rii h l{u) =< {.g I 7r(.g) = [u, d)}, p{u) > (7) 
''This is not the strict mathematical notion of inverse function, from set theory. 



64 



Ri2 says that ports belonging to the same group whose communication 
pairs also belongs to the same groups are essentially the same port. 

Ri2 h ip1,p\,mi) eiyA ma) A t(K) = r(p^) A t(p1) = t(p^) ^ ^ p"^hp\ = p\ 

(8) 

B.l Formalizing Interfaces 

This section formalizes homomorphism relations between interfaces, which 
are essential for formalizing unification and factorization operations in the 
next section. 

B.1.1 The # Operator 

The # operator allows for combining to interfaces, generating a new interface 
that inherits characteristics from original ones. It is defined as following: 

Ii# I2 =< Qi ug2,BiUB2 >, where Ii =< Qi,Bi > and I2 =< Q2, ^2 > (9) 

The sets of ports from operand interfaces may overlap. The operator U 
generates a new formal language describing a behavior for interface Ii#l2, 
which is compatible with original behavior of Ii and I2, in separate. Given 
an interleaving operator ©, from concurrent expressions [17] and I a func- 
tion that returns the language generated by a concurrent expression, formal 
definition of U is: 

B1OB2 = I [{Wi Ui) S {W2 (I)U2) S ... S {Wn Wn)] ,n>l 

where 

s G Qi n(32 

Wi S W2 S . . . S Wn ^ Bi (10) 
Ui S U2 S . . . S Un B2 

Wi W2 . . . Wn £ {Qi ~ {s})* 
Ui U2 . . . U„ e {Q2 - {s})* 
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If operand interfaces do not overlap ports, B1OB2 corresponds to inter- 
leaving of their original behaviors {Bi B2). Overlapping ports may be 
interpreted as synchronization points when combining formal languages Bi 
and B2. 

B.1.2 Homomorphisms Between Interfaces 

Let Ii =< Qi,Bi > and I2 =< Q2,B2 > be interface classes. Let H be a 
pair < h : Qi — > Q2, h : Si — > Q2* >, where h is defined as following: 

h{aw) = h(a)h(w) ^ ^ 

With respect to H =< h, h >, the following interface relations are de- 
fined: 

H _ 

Ii C I2 <^ 7m(h) C B2 

Ii 3 I2 <^ Im{h) D B2 (12) 
Ii i I2 <!4> Im{h) = B2 

Relations C and □ characterize homomorphisms between interfaces, while 
= characterize isomorphisms between them. 

B.2 An Algebra for Haskell^ Programming 

Now, it is defined an algebra to formalize Haskell^ programming task. Op- 
erations over units are defined here: unification, factorization, replication 
and assignment. They may be used to overlap and nest components that 
comprise a Haskell^ component. An algebra for Haskell# programming is 
defined as: 

< {H}, {u: H X H,f : H X H,a: H X H,r : H X H,i: H X H},iJ} > 

where generator H contains all well-formed Haskell^^ components. The 
relations u, f, a and r represents sets of pairs {hi, /12), hi & H and h2 G H, 
where h2 is a Haskell^ component obtained from Haskell^ component hi 
from an application of unification, factorization, assignment or replication 
operations, respectively, defined further. The relation i is a identity relation 
containing pairs {h, h),yh & H. 
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In what follows, assignment, unification, factorization, and replication op- 
erations, homomorphisms between Haskell^ components, are defined. Since 
all Haskell# components may be described using HCL configurations that 
should be generated using a context-free grammar, the set H is recursively 
enumerable. Thus, in what follows, the i-ith Haskell^ program, z > 0, is 
denoted by #i = {Gi,Ri,Ci), where Gi = {Ci, Si,Ui,Gi, Pi, Ri, Di,Ti, Mi}, 
Gi = {ui, 6i, ipi, 7j, Tii, (3i, Ti, Pi, z/j, Xi}. 

B.2.1 Unification and Factorization 

Unification and factorization, informally introduced in Section I2.1.10[ are 
formalized here as mutually reversible relations in the algebra of Haskell# 
programming. For instance, consider two Haskell^ components and their 
algebraic description, denoted by and for some j,k > 0. Consider 
V = (vi, V2, . . . , v„) an ordered sub-set of virtual units in Uk, and their 
respective interfaces I = (Ii, I2, . . . , such that li = i(vj) = (Qi,Bj) 
, for 1 < i < n. Also, consider a virtual unit v G Uj and its interface 
I = i(v) = (Q,B). A set of interface mappings H = (Hi, H2, . . . , H„), 
where Hj = (hj, hj) maps interface li to interface I is defined. Suppose that 
#j is obtained from by unification of virtual units in V to a unique virtual 
unit V. It is also supposed correct to say that i^k is obtained from #j by 
factorization of the virtual unit v onto the set of virtual units V. 

Two restrictions may be ensured in a correct application of unification 
and factorization operations. The first one imposes behavior preserving re- 
strictions for units, stating that v is a proper unification of virtual units in 

set V if li □ I, for 1 < i < n. Analogously, units in V constitute a proper 

factorization of v if li □ I, for 1 < i < n. The second one establishes re- 
strictions for preservation of network connectivity. But before to talk about 
them, it is necessary to define relation f : Q — ?■ 2^^. It makes possible to 
formalize partitioning of groups of ports, which must be configured explicitly 
in factorizations. In unifications, it is not necessary to configure f explicitly 
using HCL, since the inverse of partitioning of groups of ports is the union of 
them, which is resolved by merging the groups. Ports b and e in Figure |6] are 
examples of partitioning (right to left) and union (left to right) of ports. The 
relation f must satisfy the restriction defined in Equation [131 which relates 
it with interface mapping H. 
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Vg' : e G,- A (3g : g G Q : U{q) = q') ■ [ [j Hq) = r(g'), R = {q\q€QA H(g) = q'} 

(13) 

In this paragraph, restrictions for ensuring preservation of network con- 
nectivity with respect to unification/factorization are discussed. In the triv- 
ial case, where overlapping of ports does not occur (H(gi) = H(g2) =^ 
'?'(?i)l~lf(g2) = 0), all ports and channels are preserved {Pj = and Vj = Vk) 
after applying unification/factorization. Essentially, only the sets of units 
{Uj — Uk — {v}AUk — Uj — V), ownership of ports (relations tt^ and tt^), and 
grouping of ports (relations Tk and tj) differs between #j and Ownership 
and grouping of ports is affected by interface mappings H. If overlapping 
of port occurs, some adjustment of ports and channels may be necessary in 
order to ensure obedience to restrictions for channel formation. For instance, 
consider a port p, such that 3Q : Q C Q : (yq : q e Q : p E r{q)) A \Q\ > 2. 
Prom the perspective of factorization, p is interpreted as a port of unit v, 
in component ^j, that have more than one port in P^. associated to it, pos- 
sibly all belonging to distinct units in the set V of component 7^^. For 
ensuring point-to-point nature of channels (R7), the communication pair of 
p, p {{p,p,m) G Uj V {p,p,m) G uj), must be replicated in \Q\ copies as 
consequence of factorization. They are connected to the ports belonging 
to groups in Q that have association to p. From the perspective of uni- 
fication, p is a port of that comes from unification of a set of ports 
Q = {p' \ p' E Pk /\ p E f{p')} of The communication pairs of ports in 
Q, Q, are members of the same group of ports {3g : g E Gk ■ Q ^kig))- 
In such case, in order to satisfy restriction R12, ports in P are unified in a 
single port p in the communication pair of p. 

B.2.2 Assignment 

In an executable Haskell^ program, application component must not contain 
virtual units. Thus, it is necessary to define an operation for associating com- 
ponents to virtual units {nesting composition). Let and be Haskell^ 
programs, v G 14 be a virtual unit in program and ip a mapping from 
ports of interface of v to arguments and exit points of oui (main component 
of 
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Assignment of main component of #i {oui) to virtual unit v of pro- 
duces a new program the union of generators and relations from two 
programs, where v is associated to Ui through 7^. Arguments and exit points 
of cji are associated to v ports through ipk, using ip. 

B. 2.3 Replication 

Let #fc be a Haskell# program. Given a positive integer r > 1 and a collection 
of units U C Ufc, U = {ui, U2, . . . , u„}, it is possible to replicate the sub- 
network induced by units in U in r copies, forming a new program ^j. In 
order to maintain network connectivity and attendance to Haskell^ algebra 
restrictions, when defining from 7^^, it is necessary to replicate ports from 
units that are not in U but are connected to any port of some unit in U. HCL 
allows for specifying wire functions for new groups. Channels connecting unit 
ports between units in U are also replicated in n copies, one connecting each 
pair of ports from the n units copies. 

C HCL Code for NPB Benchmarks EP, IS, 
CG, and LU 

C. l EP 

component EP<NCI_XOnES.MK, MM, KN, NK, KQ, EPSILOK, A. S> with 

#define PARAMETERS (EP_Params i no_nodes mk mm nn nk nq epsilon a s) 

iterator i range [1..no_nodes] 

use Skeletons-('olle<:;ti\'e.AllR EDUCE 
use EP_FM EP Functional Module 

interface lEP (sx, sy, q) — > (sx,sy,q) wiiere: sx@IAllReduce Double # s-y@ lAUReduce Double # q@IAllReduce UDVector 

behaviour: seq {do sx; do sy; do q} 

unit sx_comm; assign AllReduce<NO_NODES, MPI_SUM, MPI_DaUBLE> to sx_comm 
unit sy_comm; assign AllReduce<nojmodes, MPI_SUM, MPI_D0UBLE> to sy_comm 
unit q_comm ; assign AllReduCE<N0_N0DES, MPI_SUM, HPI_D0UBLE> to q_comm 

[/ unify sx_comm.p[i] # sx, sy_comm.p[i] # sy, q_comm.p[i] # q to ep_unit[i] # lEP 

assign EP_FM (PARAMETERS, sx, sy, q) (sx, sy, q) to ep_unit[i] # sx # sy # q /] 

C.2 IS 

component IS<problem_class, num_procs, max_key_log2, num_buckets_log2, 

total_keys_log2, maxjterations, max_procs, test_array_size> with 

#define PARAMETERS (lS_Params PR0BLEM_CLASS NUM_PR0CS MAX_KEY_L0G2 NUM_BUCKETS_L0G2 

T0TAL_KEYS_L0G2 MAXJTERATIONS max_procs test_array_size) 

iterator i range [1, num_procs] 



69 



use Skeletons. {Misc. RShift, Collective. {AllReduce, AllToAllv}} 
use IS_FM IS Functional Module 



interface IIS (bs*, kb*, k) (bs*, kb*, k) where: hs@ I AllReduce {UArray Int Int) # yih@IAllToAllv (Int, Ptr Int) # li@RShift Int 

behaviour: seq {repeat seq {do bs; do kb} until <bs & kb>; do k} 

unit bs-comm ; assign AllReduce<num_procs, MPI_SUM, MPI_INTEGER> to bs_comm 
unit kbj^omm ; assign AllToAllv <num_procs> to kb_comm 

unit k_shift ; assign RShift<num_procs> — > _ to k_shift 

[/ unify b8_comm.p[i] # bs, kb_comm.p[i] # kb, k_comm.p[i] # k to i8_unit[i] # IIS 

assign IS_FM (PARAMETERS, bs, kb, k) (bs, kb, k) to is_unit[i] # bs # kb # k /] 

C.3 O kernel CG 
C.3.1 Esqueleto Transpose 

component Transpose<DlM, COL_faCTOR> 

iterator i, j range [1..DIM] 
iterator k range [1. .C0L_PACT0R] 

interface ITranspose (xiiUDVector) (wiiUDVector) behaviour: seq { w!; x? } 
[/ unit trans[i][j] # ITranspose wire x aII*DlM:?, w aII*DlM:? /] 
[/ connect trans[i][j] —¥ w[k] to trans[k][i] x[j] /] 

[/ factorize trans[i][j] # w — >■ x to [/ u[(.i-l)*COL_FACTOR+k] [.j] # w — )■ x /] 
adjust wires w: sum-arrays, x: split-andscatter /] 



C.3. 2 Componente CG 

component CG<DiM, col_fator, na, nonzer, shift, niter, rcond zvv> #()—>- (zeta, x) with 
#define PARAMETERS (CG_Params dim (dim*col_factor) na nonzer shift niter rcond zvv) 



use Skeletons. MPI. Collective. AllReduce 

use Transpose 

use CG_FM — — CG Functional Module 



index i range [1..dim] 
index j range [1..COl_factor] 

interface ICG (r*,q**,rho**,aux**,rnorm*,norm_temp_l*,norm_temp_2*) 

— > (r*,q**,rho**,aux**,rnorm*,norm_temp_l*,norm_temp_2*, xr.Array Int Double, zcta.:: Double) 
where: q^@ITranspose # rho® I AllReduce Double # aux@ I AllReduce Double # rnoria.® I AllReduce Double # 

r^ITYanspose # norm_temp_l@Mi/i?educe Double # norm_temp_2@M//i;educe Double # 
behaviour: repeat seq {do rho; repeat seq {do q; do aux; if rho then do rho else skip } 

until <q Sz aux Sz rho>; 
do r; do rnorm; do norm_temp_l ; do norm_temp_2;} 
until <r Sz rnorm &; q &; aux &z rho &z norm_temp_l &; norm_temp_2> 

unit q_comm; assign TranspOSE<dim, dim * dim * col_faCTOR> to q_comm 

unit r_comm; assign Transpose<DIM, DIM * DIM * COL_FACTOR> to r_comm 

[/ unit rho_comm[i]; assign AllReduc;e<DIM * c;OL_FAfTOR , MPI.SUM, MPI_DOUBLE:^o rho_comm[i] 

unit aux_comm[i]; assign AllRedijce<DIM * COL_FACTOR, MFI.SUM, MPI_D0UBLE5*o aux_comm[i] 

unit rnorm_comm[i] ; assign AllReduce<DIM * COL_FACTOR, MPI.SUM, MPI_DOUBLE>to rnorm_comm[i] 

unit norm_temp_l_comm[i] ;assign AllReduce<DIM * COL_FACTOR, MPI.SUM, MPI_DDUBLE::::to norm_temp_l_comm[i] 
unit norm_temp_2_comm[i] ;assign AllReduce<DIM * COL_FACTOR, MPI.SUM, MPI_DQUBLE>to norm_temp_2_comm[i] /] 

[/ unify q_comm.u[i] [j] # q, rho_comm[i] .p[j] # rho, aux_comm[i].p[j] # aux, 
r_comm.u[i] [j] # r, rnorm_comm[i].p[j] # rnorm, 

norm_temp_l_comm[i].p[j] # norm_t6mp_l, norm_temp_2_comm[i] .p[j] # norm_temp_2 

to cg[i][j] # ICG 



assign CG_FM (PARAMETERS, q, rho, r, aux, rnorm, norm_temp_l, norm_temp_l) 

— > (q, rho, aux, r, rnorm, norm_t6mp_l, norm_t6mp_2) 
to Cg[i][j] # q # rho # aux # r # rnorm # norm_temp_l # norm_temp_2 

/] 
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C.4 A Aplicagao Simulada LU 
C.4.1 Esqueleto Exchange_lb 

component Exchange_lb < xdiv , ydiv ,itma.x> with 

iterator m range [0..( ydiv -1)] 
iterator n range [0..( xdiv -1)] 

interface Exchange_lb # (f'rom_north**,from_wcst**, from_south**. from_cast** :: UArray (Int,Int) Double) 
—J- (to_south**, to_east**, to_north**,tp_west** :: UArray (Int,Int) Double) 
behaviour: repeat { seq {repeat seq{from_north?; 

from_west?; 
to_south!; 

to.east!} until <from_nOrth & from.west & 
to_south & to_east>; 

repeat seq{from_south?; 

from_cast?; 
to_north! ; 

to-west!} until <from_south & from_ea8t & 
to_north & to_west> 

} until itmax 
[/unit bigLoop[n] [m] # Exchange_lb /] 

[/ connect bigLoop[n] [m] — to_south to bigLoop[(n+l) mod XDiv ] [m] -i— from_north 
connect bigLoop[n] [m] — >- to_cast to bigLoop[n] [(m+l) mod YDIV ] ^ from_wcst 
connect bigLoop[n] [m] — )- to_north to bigLoop[(n+ XDIV -1) mod XDI\' ] [m] -i— from_south 
connect bigLoop[n] [m] —¥ to_wcst to bigLoop[n] [(m+ YDIV -1) mod YDIV ] i— from_east /] 



C.4. 2 Esqueleto Exchange^Sb 

component Exchange_3b < xdiv , ydiv > with 

iterator m range [0..(ydiv-1)] 
iterator n range [0..(xdiv-1)] 

interface lExchangeSb # (from_north*,from_south*, from_east*, from_we8t*:;UArray Int Double) 
— >(to_north*, to_south*, to_east*, to_west*:: UArray Int Double) 
behaviour: repeat seq {to^outh!;from_noth?; to_north!;from_south?; 

to_east!; from.west?; to_west!; from_ea8t?} until to_south 

[/ unit gl[n][m] # lExchange^Sb /] 

[/ connect gl[n][m] — gl_ts to gl [(n+l) mod XDIV ] [m] ^ gl_fn 

connect gl[n][m] — >■ gl_tn to gl [(n+ XDIV -1) mod XDIV ] [m] 4— gl_fs 
connect gl[n][m] — )■ gl_te to gl [n][(m+l) mod YDIV ]-^— gl_fw 
connect gl[n][m] — >■ gl_tw to gl [n][(m+ ydiv -1) mod YDiv ]■(— gl_fe /] 



C.4. 3 Esqueleto Exchange-4 

component Exchange_4 < XDiv , ydiv > with 

iterator n range [(). . (ydiv-2)] 
iterator s range [1..{ydiv-1)] 
iterator 1 range [0..(xdiv-2)] 
iterator r range [1..(xdiv-1)] 

iterator i range [1 .. (ydiv-2)] 
iterator j range [l..(xDlv-2)] 

interface IExchange-4 

interface IExchange-4-Ny-^^ specializes IExchange-4 

interface lExchange^^Border # (iniiUArray Int Double) — >- (outiiUArray Int Double) 
behaviour: seq {out!;in?} specializes IExchange-4 

interface IExchange-4-Comer-NW # (inl, in2::UArray Int Double) — > () 
behaviour: seq {inl?;in2?} specializes IExchange-4 
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interface IExchange_4-Corner_SE # ()^(outl, out2: :XJArray Int Double) 
behaviour: seq {outl!;out2!} specializes IExchange_4 

[/ unit liO[i][j] # IExchang€_4.NuU /] 

unit liO[0][0] # IExchange_4_Corner.NW 

unit hO[ XDiv -1][ ydiv -1] # lExchange^^ComerSE 

[/ unit ht)[n][t)] # lExchangeJ^.Border /] 

[/ unit hO[s][ XDIV -1] # IExchange-4-Border /] 

[/ unit liO[l][0] # IExchange.4.Border /] 

[/ unit hO[r][ YDIV -1] # lExchange^.Border /] 

[/ connect hO[0][s] out to hO[0][s-l] ^ in /] 

[/ connect hO[ xdiv -1][s] — >■ out to hO[ xdiv -1][s-1] in /] 

[/ connect liO[r][0] out to hO[r-l][0] i- in /] 

[/ connect liO[r][ ydiv -1] — >■ out to hO[r-l][ ydiv -1] i— in /] 



C.4.4 Esqueleto Exchange_5 

component Exchange_5 < xdiv , ydiv > with 

iterator m range [0..(ydiv-1)] 
iterator n range [0..(XDIV-1)] 
iterator i range [1..(ydiv-2)] 
iterator j range [1..(xdiv-2)] 

interface generalization lExchange^S 

interface I Exchangers -Null specializes lExchangeS 

interface lExchangeS-Top # (in::UArray Int Double)— behaviour: in? specializes lExchangeS 
interface lExchangeS-Bottom # ()— J-(out::UArray Int Double) behaviour: out! specializes lExchangeS 
interface lExchangeSSide # (in::UArray Int Double)— >-(out::UArray Int Double) 
behaviour: seq {out!;in?} specializes lExchangeS 

[/ unit hl[i][m] # lExchange.S-Null /] 

unit hl[0][0] # IExchange_5-Top 

unit lil[0][ ydiv -1] # IExchange-5-Top 

unit hl[ xdiv -1][0] # lExchangeS-Bottom 

unit hl[ xdiv -1][ ydiv -1] # lExchangeS-Bottom 

[/ unit lil[j][0] # lExchange-S-Side 

unit hl[j][ YDIV -1] # IExchange-5-Side /] 

[/ connect hl[l][0]-)-out to hl[l-l][0]-(-in /] [/ connect hl[l][ ydiv -1]-J-Qut to hl[l-l][ ydiv -l]-(-in /] 



C.4.5 Esqueleto Exchanges 

component Exchangc_6 < xdiv , ydiv > with 

iterator m range [0..(ydiv-1)] 
iterator n range [0. . (XDIV-l)] 

iterator i range [1..(ydiv-2)] 
iterator j range [1..(xdiv-2)] 

interface generalization IExchange-6 

interface IExchange-6-Null specializes IExchange-6 

interface IExchange-6-Left # (iniiUArray Int Double) () behaviour: in? specializes IExchange-6 
interface IExchange-6-Right # ()— >(out::UArray Int Double) behaviour: out! specializes IExchange-6 
interface IExchange-6-Side # (in::UArray Int Double) — >■ (Qut::UArray Int Double) 
behaviour: seq {out!; in?} specializes IExchange-6 

[/ unit hl[i][m] # IExchangc_6_Null /] 

unit hl[0][0] # IExchange-6-Left 
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unit hl[0][ YDIV -1] # IExchange.6.Left 

unit hl[ XDIV -1][0] # IExchange_6-Right 

unit hl[ XDIV -1][ YDIV -1] # IExchange.6-Right 

[/ unit hl[j][0] # IExchange-6Side 

unit hl[j][ YDIV -1] # IExchange.6-Side /] 

[/ connect hl[0][l] -> out to hl[0][l-l] <- in /] [/ connect hl[ XDiv -1][1] -^ out to hl[ XDiv -1][1-1 ■(- in /] 



C.4.6 Componente LU (Esqueleto de Aplicagao) 

component LU <KPROCS,PR0BLEM_SIZE,DT_DEFAULT,ITMAX> with 
#define D ilog2(nprocs)/2 

#define XDIV (ipow2(if (d*2 ilog2(nprocs), D, D + 1))) 

#define YDIV (ipow2(d)) 

#define PARAMETERS (LU_Params nprocs problem_size dt_default itmax) 

use SkELETONS-MPI-{AllR EDUCE, BC AST} 

use Exchangc_lb, Exchange_3b, Exchange_4, Exchange_5, Exchange_6 
use LU_FM LU Functional Module 

iterator m range [l.YDiv] 
iterator n range [1,XDIV] 



interface IL U (ipr,inorm,itma^,nx0,ny0,nz0,dt,omega,tolrsd,rsdnm*,errnm,frcl,frc2,frc3,rsdl,rsd0,ul,phis,phiver,phivor) 

— > (ipr,inorm, itmax, nxO,nyO,nzO,dt, omega, tolrsd,rsdnm*,errnm,frcl,frc2,frc3,rsdl,rsd0,ul, phis, phiver,phivor) 
where: ipr, inorm, itmax, nxO, nyO, nzO ©IBCast Int # 
dt, omega @IBCast Double # 

tolrsd (QIBCast MyArrayld, # 

rsdnm, errnm @IAllReduce MyArrayld # 

frcl, frc2, frcS ©lAllReduce Double # 

rsdl @IExchange-lb # 

rsdO, ul @IExchange.3b # 

phis @ [Exchange^ # 

phiver @. I Exchange_5 # 

phihor @.IExchange^(> # 

behaviour: seq { do ipr; do inorm; do itmax; do nxO; do nyO; do nzO; do dt; do omega; do tolrsd; do rsdO; 

do ul; do rsdnm; do rsdl; do ul; do rsdnm; do errnm; do phis; do frcl; do phiver; do frc2; 
do phihor; do frcS } 



unit ipr_comm 


; assign BCast 


< 


XDIV 


* YDIV 


> 


to 


ipr_comm 


unit inorm_comm 


; assign BCast 


< 


XDIV 


* YDIV 


> 


to 


inorm_comm 


unit itmax_comm 


; assign BCast 


< 


XDIV 


* YDIV 


> 


to 


itmax_comm 


unit nxO_comm 


; assign BCast 


< 


XDIV 


* YDIV 


> 


to 


nxO_COmm 


unit nyO_comm 


; assign BCast 


< 


XDIV 


* YDIV 


> 


to 


nyO_COmm 


unit nzO_comm 


; assign BCast 


< 


XDIV 


* YDIV 


> 


to 


nzO_comm 


unit dt_comm 


; assign BCast 


< 


XDIV 


* YDIV 


> 


to 


dt_comm 


unit omega_comm 


; assign BCast 


< 


XDIV 


* YDIV 


> 


to 


omega_comm 


unit tolrsd-comm 


; assign BCast 


< 


XDIV 


* YDIV > 


to 


tolrsd-comm 


unit rsdO_comm 


; assign Exchange_3b 


< 


XDIV 


, YDIV 


> 


to 


rsdO_comm 


unit ul_comm 


; assign Exchange_3b 


< 


XDIV 


, YDIV 


> 


to 


ul_comm 


unit rsdnmjiomm 


; assign AllReduce 


< 


XDIV 


* YDIV 


, MPI_D0UBLE, MPI^UMtO 


rsdnm_comm 


unit ssor_comm 


; assign Exchange_lb 


< 


XDIV 


, YDIV 


, ITMAX, NZ> 


to 


ssor_comm 


unit errnm_comm 


; assign AllReduce 


< 


XDIV 


* YDIV 


, MPLDOUBLE, 


MPLS UM to 


crrnm_comm 


unit phis_comm 


; assign Exchangc_4 


< 


XDIV 


, YDIV 


> 


to 


phis_comm 


unit frcl_comm 


; assign AllReduce 


< 


XDIV 


* YDIV 


, mpi_double. 


MPI^UMtO 


frcl_comm 


unit phiver_comm 


; assign Exchange_5 


< 


XDIV 


, YDIV 


> 


to 


phiver _comm 


unit frc2_comm 


; assign AllReduce 


< 


XDIV 


* YDIV 


, mpi_double. 


MPI_SUM'tO 


frc2_comm 


unit phihor_comm 


; assign Exchange_6 


< 


XDIV 


, YDIV 


> 


to 


phihor_comm 


unit frc3_comm 


; assign AllReduce 


< 


XDIV 


* YDIV 


, mpi_double. 


MPI_SUM'tO 


frc3_COmm 



[/ unify 



iprj;omm.p[n] [m] 


# ipr , 


inorm_comm.p[n] [m] 


# inorm 


itmax_comm.p[n] [m] 


# itmax 


dt_comm.p[n] [m] 


# dt , 


omega_comm.p[n] [m] 


# omega 


tolrsd-Comm . p [n] [m] 


# tolrsd, 


nxO_comm.p[n] [m] 


# nxO , 


nyO_comm.p[n] [m] 


# nyO , 


nzO_COmm.p[n] [m] 


# nzO , 


r8dO_COmm.gl[n] [m] 


# rsdO , 


ul_comm.gl[n] [m] 


# ul , 
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rsdnm_comm.p[n] [m] 
ssor_comm.bigLoop[n] [m] 
errnm_comm.p[n] [m] 
phis_comm.li{) [n] [m] 
frcl_COmm.p[n] [m] 
phiver_comm.hl[n] [m] 
frc2_comm.p[n] [m] 
phihor_comm.h2[n] [m] 
frc3_COmm . p [n] [m] 



# rsdnm 

# rsdl , 

# crrnm 

# phis , 

# frcl , 

# phivcr, 

# frc2 , 

# phihor, 



# frc3 to lu[n][m] # /LC/ 



assign LU_FM(PARAMETERS, ipr,inorm,itmax,nxO,nyO,nzO,dt, omega, tolrsd, rsdnm, errnm, 
frcl, frc2,frc3,rsdl,rsd0,ul, phis, phiver,phivor) 

— (ipr,inorm,itmax,nxO,nyO,nzO,dt, omega, tolrsd, rsdnm, crrnm, 
frcl,frc2,frc3,rsdl,rsdU,ul,phis,phivcr,phivor) 
to lu[n][m] # ipr # inorm # itmax # nxO # nyO # nzO # dt # omega # tolrsd # rsdnm # errnm 
# frcl # frc2 # frc3 # rsdl # rsdO # ul # phis # phiver # phivor /] 
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